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Preface 


his text is the eight edition of Essentials of Modern Business Statistics with Microsoft® 

Excel®. With this edition we welcome two eminent scholars to our author team: 
Michael J. Fry of the University of Cincinnati and Jeffrey W. Ohlmann of the University of 
Iowa. Both Mike and Jeff are accomplished teachers, researchers, and practitioners in the 
fields of statistics and business analytics. You can read more about their accomplishments in 
the About the Authors section that follows this preface. We believe that the addition of Mike 
and Jeff as our coauthors will both maintain and improve the effectiveness of Essentials of 
Modern Business Statistics with Microsoft Excel. 

The purpose of Essentials of Modern Business Statistics with Microsoft Excel is to give 
students, primarily those in the fields of business administration and economics, a concep- 
tual introduction to the field of statistics and its many applications. The text is applications 
oriented and written with the needs of the nonmathematician in mind; the mathematical 
prerequisite is knowledge of algebra. 

Applications of data analysis and statistical methodology are an integral part of the orga- 
nization and presentation of the text material. The discussion and development of each tech- 
nique is presented in an applications setting, with the statistical results providing insights to 
decisions and solutions to applied problems. 

Although the book is applications oriented, we have taken care to provide sound meth- 
odological development and to use notation that is generally accepted for the topic being 
covered. Hence, students will find that this text provides good preparation for the study of 
more advanced statistical material. A bibliography to guide further study is included as an 
appendix. 


Use of Microsoft Excel for Statistical Analysis 


Essentials of Modern Business Statistics with Microsoft Excel is first and foremost a statis- 
tics textbook that emphasizes statistical concepts and applications. But since most practical 
problems are too large to be solved using hand calculations, some type of statistical software 
package is required to solve these problems. There are several excellent statistical packages 
available today. However, because most students and potential employers value spreadsheet 
experience, many schools now use a spreadsheet package in their statistics courses. Micro- 
soft Excel is the most widely used spreadsheet package in business as well as in colleges and 
universities. We have written Essentials of Modern Business Statistics with Microsoft Excel 
especially for statistics courses in which Microsoft Excel is used as the software package. 

Excel has been integrated within each of the chapters and plays an integral part in pro- 
viding an application orientation. Although we assume that readers using this text are 
familiar with Excel basics such as selecting cells, entering formulas, and copying we do 
not assume that readers are familiar with Excel or Excel’s tools for statistical analysis. As 
a result, we have included Appendix E, which provides an introduction to Excel and tools 
for statistical analysis. 

Throughout the text the discussion of using Excel to perform a statistical procedure ap- 
pears in a subsection immediately following the discussion of the statistical procedure. We 
believe that this style enables us to fully integrate the use of Excel throughout the text, but 
still maintain the primary emphasis on the statistical methodology being discussed. In each 
of these subsections, we use a standard format for using Excel for statistical analysis. There 
are four primary tasks: Enter/Access Data, Enter Functions and Formulas, Apply Tools, and 
Editing Options. We believe a consistent framework for applying Excel helps users to focus 
on the statistical methodology without getting bogged down in the details of using Excel. 

In presenting worksheet figures we often use a nested approach in which the worksheet 
shown in the background of the figure displays the formulas and the worksheet shown in the 
foreground shows the values computed using the formulas. Different colors and shades of 
colors are used to differentiate worksheet cells containing data, highlight cells containing 
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Excel functions and formulas, and highlight material printed by Excel as a result of using 
one or more data analysis tools. 


Changes in the Eighth Edition 


We appreciate the acceptance and positive response to the previous editions of Essentials of 
Modern Business Statistics with Microsoft Excel. Accordingly, in making modifications for 
this new edition, we have maintained the presentation style and readability of those editions. 
The significant changes in the new edition are summarized here. 


e Software. In addition to step-by-step instructions and screen captures that show 
how to use the latest version of Excel to implement statistical procedures, we also 
provide instructions for Excel Online and R through the MindTap Reader. 


e New Examples and Exercises Based on Real Data. In this edition, we have added 
headers to all Applications exercises to make the application of each exercise more 
clear. We have also added over 160 new examples and exercises based on real data 
and referenced sources. By using data from sources also used by The Wall Street 
Journal, USA Today, The Financial Times, Forbes, and others, we have drawn from 
actual studies and applications to develop explanations and create exercises that 
demonstrate the many uses of statistics in business and economics. We believe 
that the use of real data from interesting and relevant problems generates greater 
student interest in the material and enables the student to more effectively learn 
about both statistical methodology and its application. 


e Case Problems. We have added four new case problems to this edition. The 47 case 
problems in the text provide students with the opportunity to analyze somewhat 
larger data sets and prepare managerial reports based on the results of their analysis. 


e Appendixes for Use of R. We now provide appendixes in the MindTap Reader for 
many chapters that demonstrate the use of the popular open-source software R and 
RStudio for statistical applications. The use of R is not required to solve any prob- 
lems or cases in the textbook, but the appendixes provide an introduction to R and 
RStudio for interested instructors and students. 


Features and Pedagogy 


Authors Anderson, Sweeney, Williams, Camm, Cochran, Fry, and Ohlmann have continued 
many of the features that appeared in previous editions. Important ones for students are 
noted here. 


Methods Exercises and Applications Exercises 


The end-of-section exercises are split into two parts, Methods and Applications. The Meth- 
ods exercises require students to use the formulas and make the necessary computations. The 
Applications exercises require students to use the chapter material in real-world situations. 
Thus, students first focus on the computational “nuts and bolts” and then move on to the 
subtleties of statistical application and interpretation. 


Margin Annotations and Notes and Comments 


Margin annotations that highlight key points and provide additional insights for the student 
are a key feature of this text. These annotations, which appear in the margins, are designed 
to provide emphasis and enhance understanding of the terms and concepts being presented 
in the text. 

At the end of many sections, we provide Notes and Comments designed to give the stu- 
dent additional insights about the statistical methodology and its application. Notes and 
Comments include warnings about or limitations of the methodology, recommendations for 
application, brief descriptions of additional technical considerations, and other matters. 
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Data Files Accompany the Text 


Over 250 data files are available on the website that accompanies the text. DATAfile logos 
are used in the text to identify the data sets that are available on the website. Data sets for all 
case problems as well as data sets for larger exercises are included. 


MindTap 


MindTap, featuring all new Excel Online integration powered by Microsoft, is a complete 
digital solution for the business statistics course. It has enhancements that take students from 
learning basic statistical concepts to actively engaging in critical thinking applications, while 
learning valuable software skills for their future careers. The R appendixes for many of the 
chapters in the text are also accessible through MindTap. 

MindTap is a customizable digital course solution that includes an interactive eBook and 
autograded, algorithmic exercises from the textbook. All of these materials offer students 
better access to understand the materials within the course. For more information on Mind- 
Tap, please contact your Cengage representative. 


For Students 


Online resources are available to help the student work more efficiently. The resources can 
be accessed at www.cengage.com/decisionsciences/anderson/embs/8e. 


For Instructors 


Instructor resources are available to adopters on the Instructor Companion Site, which can 
be found and accessed at www.cengage.com/decisionsciences/anderson/embs/8e, including: 


e Solutions Manual: The Solutions Manual, prepared by the authors, includes solu- 
tions for all problems in the text. It is available online as well as print. 


e Solutions to Case Problems: These are also prepared by the authors and contain 
solutions to all case problems presented in the text. 


e PowerPoint Presentation Slides: The presentation slides contain a teaching out- 
line that incorporates figures to complement instructor lectures. 


e Test Bank: Cengage Learning Testing Powered by Cognero is a flexible, online 
system that allows you to: 


e author, edit, and manage test bank content from multiple Cengage Learning 
solutions, 


e create multiple test versions in an instant, and 


e deliver tests from your LMS, your classroom, or wherever you want. 
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STATISTICS IN PRACTICE 


Bloomberg Businessweek* 


berg 
NEW YORK, NEW YORK Busi ssweep 


Bloomberg Businessweek is one of the most widely 
read business magazines in the world. Along with 
feature articles on current topics, the magazine contains 
articles on international business, economic analysis, 
information processing, and science and technology. In- 
formation in the feature articles and the regular sections 
helps readers stay abreast of current developments and 
assess the impact of those developments on business - A 
and economic conditions. sane riiim À 
Most issues of Bloomberg Businessweek provide an rN 
in-depth report on a topic of current interest. Often, the 
in-depth reports contain statistical facts and summaries 
that help the reader understand the business and eco- 
nomic information. Examples of articles and reports in- 
clude the impact of businesses moving important work 


to cloud computing, the crisis facing the U.S. Postal Bloomberg Businessweek uses statistical facts and summaries 
in many of its articles. AP Images/Weng lei-Imaginechina 


Service, and why the debt crisis is even worse than we 
think. In addition, Bloomberg Businessweek provides a purchases at work. Such statistics alert Bloomberg 


variety of statistics about the state of the economy, in- Businessweek managers to subscriber interest in articles 
cluding production indexes, stock prices, mutual funds, about new developments in computers. The results 
and interest rates. of the subscriber survey are also made available to 
Bloomberg Businessweek also uses statistics and potential advertisers. The high percentage of subscrib- 
statistical information in managing its own business. ers involved with computer purchases at work would be 
For example, an annual survey of subscribers helps an incentive for a computer manufacturer to consider 
the company learn about subscriber demographics, advertising in Bloomberg Businessweek. 
reading habits, likely purchases, lifestyles, and so on. In this chapter, we discuss the types of data available 
Bloomberg Businessweek managers use statistical for statistical analysis and describe how the data are ob- 
summaries from the survey to provide better services tained. We introduce descriptive statistics and statistical 
to subscribers and advertisers. One North American inference as ways of converting data into meaningful 
subscriber survey indicated that 64% of Bloomberg and easily interpreted statistical information. 


Businessweek subscribers are involved with computer 


*The authors are indebted to Charlene Trentham, Research Manager, 
for providing this Statistics in Practice. 


Frequently, we see the following types of statements in newspapers and magazines: 


e Unemployment dropped to an 18-year low of 3.8% in May 2018 from 3.9% in 
April and after holding at 4.1% for the prior six months (Wall Street Journal, 
June 1, 2018). 

e Tesla ended 2017 with around $5.4 billion of liquidity. Analysts forecast it 
will burn through $2.8 billion of cash this year (Bloomberg Businessweek, 

April 19, 2018). 

e The biggest banks in America reported a good set of earnings for the first three 
months of 2018. Bank of America and Morgan Stanley made quarterly net profits of 
$6.9 billion and $2.7 billion, respectively (The Economist, April 21, 2018). 

e According to a study from the Pew Research Center, 15% of U.S. adults say they 
have used online dating sites or mobile apps (Wall Street Journal, May 2, 2018). 
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e According to the U.S. Centers for Disease Control and Prevention, in the United 
States alone, at least 2 million illnesses and 23,000 deaths can be attributed each year 
to antibiotic-resistant bacteria (Wall Street Journal, February 13, 2018). 


The numerical facts in the preceding statements—3.8%, 3.9%, 4.1%, $5.4 billion, 
$2.8 billion, $6.9 billion, $2.7 billion, 15%, 2 million, 23,000—are called statistics. In this 
usage, the term statistics refers to numerical facts such as averages, medians, percentages, 
and maximums that help us understand a variety of business and economic situations. 
However, as you will see, the subject of statistics involves much more than numerical facts. 
In a broader sense, statistics is the art and science of collecting, analyzing, presenting, 
and interpreting data. Particularly in business and economics, the information provided by 
collecting, analyzing, presenting, and interpreting data gives managers and decision makers 
a better understanding of the business and economic environment and thus enables them to 
make more informed and better decisions. In this text, we emphasize the use of statistics 
for business and economic decision making. 

Chapter | begins with some illustrations of the applications of statistics in business 
and economics. In Section 1.2 we define the term data and introduce the concept of a data 
set. This section also introduces key terms such as variables and observations, discusses 
the difference between quantitative and categorical data, and illustrates the uses of cross- 
sectional and time series data. Section 1.3 discusses how data can be obtained from 
existing sources or through survey and experimental studies designed to obtain new data. 
The uses of data in developing descriptive statistics and in making statistical inferences 
are described in Sections 1.4 and 1.5. The last four sections of Chapter | provide the role 
of the computer in statistical analysis, an introduction to business analytics and the role 
statistics plays in it, an introduction to big data and data mining, and a discussion of ethical 
guidelines for statistical practice. 


1.1 Applications in Business and Economics 


In today’s global business and economic environment, anyone can access vast amounts of 
statistical information. The most successful managers and decision makers understand the 
information and know how to use it effectively. In this section, we provide examples that 
illustrate some of the uses of statistics in business and economics. 


Accounting 


Public accounting firms use statistical sampling procedures when conducting audits for 
their clients. For instance, suppose an accounting firm wants to determine whether the 
amount of accounts receivable shown on a client’s balance sheet fairly represents the actual 
amount of accounts receivable. Usually the large number of individual accounts receivable 
makes reviewing and validating every account too time-consuming and expensive. As 
common practice in such situations, the audit staff selects a subset of the accounts called a 
sample. After reviewing the accuracy of the sampled accounts, the auditors draw a conclu- 
sion as to whether the accounts receivable amount shown on the client’s balance sheet is 
acceptable. 


Finance 


Financial analysts use a variety of statistical information to guide their investment 
recommendations. In the case of stocks, analysts review financial data such as price/ 
earnings ratios and dividend yields. By comparing the information for an individual 
stock with information about the stock market averages, an analyst can begin to draw 
a conclusion as to whether the stock is a good investment. For example, the average 
dividend yield for the S&P 500 companies for 2017 was 1.88%. Over the same period 
the average dividend yield for Microsoft was 1.72% (Yahoo Finance). In this case, the 
statistical information on dividend yield indicates a lower dividend yield for Microsoft 
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than the average dividend yield for the S&P 500 companies. This and other inform- 
ation about Microsoft would help the analyst make an informed buy, sell, or hold 
recommendation for Microsoft stock. 


Marketing 


Electronic scanners at retail checkout counters collect data for a variety of marketing 
research applications. For example, data suppliers such as The Nielsen Company and 

IRI purchase point-of-sale scanner data from grocery stores, process the data, and then 
sell statistical summaries of the data to manufacturers. Manufacturers spend hundreds of 
thousands of dollars per product category to obtain this type of scanner data. Manufactur- 
ers also purchase data and statistical summaries on promotional activities such as special 
pricing and the use of in-store displays. Brand managers can review the scanner statistics 
and the promotional activity statistics to gain a better understanding of the relationship 
between promotional activities and sales. Such analyses often prove helpful in establishing 
future marketing strategies for the various products. 


Production 


Today’s emphasis on quality makes quality control an important application of statistics in 
production. A variety of statistical quality control charts are used to monitor the output 

of a production process. In particular, an x-bar chart can be used to monitor the average 
output. Suppose, for example, that a machine fills containers with 12 ounces of a soft 
drink. Periodically, a production worker selects a sample of containers and computes the 
average number of ounces in the sample. This average, or x-bar value, is plotted on an 
x-bar chart. A plotted value above the chart’s upper control limit indicates overfilling, and 
a plotted value below the chart’s lower control limit indicates underfilling. The process is 
termed “in control” and allowed to continue as long as the plotted x-bar values fall between 
the chart’s upper and lower control limits. Properly interpreted, an x-bar chart can help 
determine when adjustments are necessary to correct a production process. 


Economics 


Economists frequently provide forecasts about the future of the economy or some aspect 
of it. They use a variety of statistical information in making such forecasts. For instance, 
in forecasting inflation rates, economists use statistical information on such indicators as 
the Producer Price Index, the unemployment rate, and manufacturing capacity utilization. 
Often these statistical indicators are entered into computerized forecasting models that 
predict inflation rates. 


Information Systems 


Information systems administrators are responsible for the day-to-day operation of an 
organization’s computer networks. A variety of statistical information helps administrat- 
ors assess the performance of computer networks, including local area networks (LANs), 
wide area networks (WANs), network segments, intranets, and other data communication 
systems. Statistics such as the mean number of users on the system, the proportion of time 
any component of the system is down, and the proportion of bandwidth utilized at various 
times of the day, are examples of statistical information that help the system administrator 
better understand and manage the computer network. 


Applications of statistics such as those described in this section are an integral part of 

this text. Such examples provide an overview of the breadth of statistical applications. To 
supplement these examples, practitioners in the fields of business and economics provided 
chapter-opening Statistics in Practice articles that introduce the material covered in each 
chapter. The Statistics in Practice applications show the importance of statistics in a wide 
variety of business and economic situations. 
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1.2 Data 


Data are the facts and figures collected, analyzed, and summarized for presentation and 
interpretation. All the data collected in a particular study are referred to as the data set 
for the study. Table 1.1 shows a data set containing information for 60 nations that par- 
ticipate in the World Trade Organization (WTO). The WTO encourages the free flow of 
international trade and provides a forum for resolving trade disputes. 


Elements, Variables, and Observations 


Elements are the entities on which data are collected. Each nation listed in Table 1.1 is an 
element with the nation or element name shown in the first column. With 60 nations, the 
data set contains 60 elements. 

A variable is a characteristic of interest for the elements. The data set in Table 1.1 
includes the following five variables: 


e WTO Status: The nation’s membership status in the World Trade Organization; this 
can be either as a member or as an observer. 

e Per Capita Gross Domestic Product (GDP) ($): The total market value ($) of all 
goods and services produced by the nation divided by the number of people in the 
nation; this is commonly used to compare economic productivity of the nations. 

e Fitch Rating: The nation’s sovereign credit rating as appraised by the Fitch Group'; the 
credit ratings range from a high of AAA to a low of F and can be modified by + or —. 

e Fitch Outlook: An indication of the direction the credit rating is likely to move over 
the upcoming two years; the outlook can be negative, stable, or positive. 


Measurements collected on each variable for every element in a study provide the data. The 
set of measurements obtained for a particular element is called an observation. Referring 
to Table 1.1, we see that the first observation contains the following measurements: 
Member, 3615, BB-, and Stable. The second observation contains the following 
measurements: Member, 49755, AAA, Stable, and so on. A data set with 60 elements 
contains 60 observations. 


Scales of Measurement 


Data collection requires one of the following scales of measurement: nominal, ordinal, 
interval, or ratio. The scale of measurement determines the amount of information con- 
tained in the data and indicates the most appropriate data summarization and statistical 
analyses. 

When the data for a variable consist of labels or names used to identify an attribute 
of the element, the scale of measurement is considered a nominal scale. For example, 
referring to the data in Table 1.1, the scale of measurement for the WTO Status variable is 
nominal because the data “member” and “observer” are labels used to identify the status 
category for the nation. In cases where the scale of measurement is nominal, a numerical 
code as well as a nonnumerical label may be used. For example, to facilitate data collec- 
tion and to prepare the data for entry into a computer database, we might use a numerical 
code for the WTO Status variable by letting 1 denote a member nation in the World Trade 
Organization and 2 denote an observer nation. The scale of measurement is nominal even 
though the data appear as numerical values. 

The scale of measurement for a variable is considered an ordinal scale if the data 
exhibit the properties of nominal data and in addition, the order or rank of the data is mean- 
ingful. For example, referring to the data in Table 1.1, the scale of measurement for the 
Fitch Rating is ordinal, because the rating labels which range from AAA to F can be rank 
ordered from best credit rating AAA to poorest credit rating F. The rating letters provide 


'The Fitch Group is one of three nationally recognized statistical rating organizations designated by the U.S. 
Securities and Exchange Commission. The other two are Standard and Poor's and Moody's investor service. 
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TABLE 1.1 Data Set for 60 Nations in the World Trade Organization 


WTO Per Capita Fitch Fitch 
Nation Status GDP ($) Rating Outlook 
Armenia Member 3,615 BB- Stable 
Australia Member 49,755 AAA Stable 
Austria Member 44,758 AAA Stable 
Azerbaijan Observer 3,879 BBB— Stable 
Bahrain Member 22,579 BBB Stable 
Belgium Member A AE AA Stable 
= , Brazil Member 8,650 BBB Stable 
= DATA file Bulgaria Member 7,469 BBB— Stable 
Nations Canada Member 42,349 AAA Stable 
Cape Verde Member 2,998 Bt Stable 
Data sets such as Nations Chile Member 13,793 AGE Stable 
are available on the com- China Member 8,123 A+ Stable 
panion site for this title. Colombia Member 5,806 BBB— Stable 
Costa Rica Member 11,825 BB+ Stable 
Croatia Member 12,149 BBB— Negative 
Cyprus Member 23,541 B Negative 
Czech Republic Member 18,484 A+ Stable 
Denmark Member 53,579 AAA Stable 
Ecuador Member 6,019 Be Positive 
Egypt Member 3,478 B Negative 
El Salvador Member 4,224 BB Negative 
Estonia Member WY A+ Stable 
France Member 36,857 AAA Negative 
Georgia Member 3,866 BBS Stable 
Germany Member 42,161 AAA Stable 
Hungary Member 12,820 BB+ Stable 
Iceland Member 60,530 BBB Stable 
Ireland Member 64,175 BBB+ Stable 
Israel Member 37,181 A Stable 
Italy Member 30,669 A- Negative 
Japan Member 38,972 A+ Negative 
Kazakhstan Observer UNS BBB+ Stable 
Kenya Member 1,455 Bae Stable 
Latvia Member 14,071 BBB Positive 
Lebanon Observer 8,257 B Stable 
Lithuania Member 14,913 BBB Stable 
Malaysia Member 9,508 A- Stable 
Mexico Member 8,209 BBB Stable 
Peru Member 6,049 BBB Stable 
Philippines Member 2251 BB+ Stable 
Poland Member 12,414 A- Positive 
Portugal Member 19,872 BB+ Negative 
South Korea Member 272539 AA- Stable 
Romania Member 91523 BBB— Stable 
Russia Member 8,748 BBB Stable 
Rwanda Member 703 B Stable 
Serbia Observer 5,426 BB- Negative 
Singapore Member 52,962 AAA Stable 
Slovakia Member 16,530 A+ Stable 
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Slovenia Member 21,650 A- Negative 
South Africa Member 5275 BBB Stable 
Spain Member 26,617 A- Stable 
Sweden Member 51,845 AAA Stable 
Switzerland Member 79,888 AAA Stable 
Thailand Member 5,911 BBB Stable 
Turkey Member 10,863 BBB— Stable 
United Kingdom Member 40,412 AAA Negative 
United States Member 57,638 AAA Stable 
Uruguay Member 157221 BB+ Positive 
Zambia Member 1,270 B+ Negative 


the labels similar to nominal data, but in addition, the data can also be ranked or ordered 
based on the credit rating, which makes the measurement scale ordinal. Ordinal data can 
also be recorded by a numerical code, for example, your class rank in school. 

The scale of measurement for a variable is an interval scale if the data have all the 
properties of ordinal data and the interval between values is expressed in terms of a fixed 
unit of measure. Interval data are always numeric. College admission SAT scores are an 
example of interval-scaled data. For example, three students with SAT math scores of 
620, 550, and 470 can be ranked or ordered in terms of best performance to poorest per- 
formance in math. In addition, the differences between the scores are meaningful. For 
instance, student 1 scored 620 — 550 = 70 points more than student 2, while student 2 
scored 550 — 470 = 80 points more than student 3. 

The scale of measurement for a variable is a ratio scale if the data have all the prop- 
erties of interval data and the ratio of two values is meaningful. Variables such as distance, 
height, weight, and time use the ratio scale of measurement. This scale requires that a 
zero value be included to indicate that nothing exists for the variable at the zero point. For 
example, consider the cost of an automobile. A zero value for the cost would indicate that 
the automobile has no cost and is free. In addition, if we compare the cost of $30,000 for 
one automobile to the cost of $15,000 for a second automobile, the ratio property shows 
that the first automobile is $30,000/$15,000 = 2 times, or twice, the cost of the second 
automobile. 


Categorical and Quantitative Data 


Data can be classified as either categorical or quantitative. Data that can be grouped by 
specific categories are referred to as categorical data. Categorical data use either the nom- 
inal or ordinal scale of measurement. Data that use numeric values to indicate how much 
or how many are referred to as quantitative data. Quantitative data are obtained using 
either the interval or ratio scale of measurement. 
The statistical method A categorical variable is a variable with categorical data, and a quantitative variable 
appropriate for is a variable with quantitative data. The statistical analysis appropriate for a particular 
summarizing data depends variable depends upon whether the variable is categorical or quantitative. If the variable 
upon whether the data are is categorical, the statistical analysis is limited. We can summarize categorical data by 
categorical or quantitative. counting the number of observations in each category or by computing the proportion of 
the observations in each category. However, even when the categorical data are identified 
by a numerical code, arithmetic operations such as addition, subtraction, multiplication, 
and division do not provide meaningful results. Section 2.1 discusses ways of summarizing 
categorical data. 
Arithmetic operations provide meaningful results for quantitative variables. For ex- 
ample, quantitative data may be added and then divided by the number of observations to 
compute the average value. This average is usually meaningful and easily interpreted. In 
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general, more alternatives for statistical analysis are possible when data are quantitative. 
Section 2.2 and Chapter 3 provide ways of summarizing quantitative data. 


Cross-Sectional and Time Series Data 


For purposes of statistical analysis, distinguishing between cross-sectional data and time 
series data is important. Cross-sectional data are data collected at the same or approxim- 
ately the same point in time. The data in Table 1.1 are cross-sectional because they describe 
the five variables for the 60 World Trade Organization nations at the same point in time. 
Time series data are data collected over several time periods. For example, the time series 
in Figure 1.1 shows the U.S. average price per gallon of conventional regular gasoline 
between 2012 and 2018. From January 2012 until June 2014, prices fluctuated between 
$3.19 and $3.84 per gallon before a long stretch of decreasing prices from July 2014 to 
January 2015. The lowest average price per gallon occurred in January 2016 ($1.68). Since 
then, the average price appears to be on a gradual increasing trend. 

Graphs of time series data are frequently found in business and economic publications. 
Such graphs help analysts understand what happened in the past, identify any trends over 
time, and project future values for the time series. The graphs of time series data can take 
on a variety of forms, as shown in Figure 1.2. With a little study, these graphs are usually 
easy to understand and interpret. For example, Panel (A) in Figure 1.2 is a graph that shows 
the Dow Jones Industrial Average Index from 2008 to 2018. Poor economic conditions 
caused a serious drop in the index during 2008 with the low point occurring in February 
2009 (7,062). After that, the index has been on a remarkable nine-year increase, reaching 
its peak (26,149) in January 2018. 

The graph in Panel (B) shows the net income of McDonald’s Inc. from 2008 to 2017. The 
declining economic conditions in 2008 and 2009 were actually beneficial to McDonald’s as 
the company’s net income rose to all-time highs. The growth in McDonald’s net income 
showed that the company was thriving during the economic downturn as people were 
cutting back on the more expensive sit-down restaurants and seeking less-expensive 
alternatives offered by McDonald’s. McDonald’s net income continued to new all-time 
highs in 2010 and 2011, decreased slightly in 2012, and peaked in 2013. After three years of 
relatively lower net income, their net income increased to $5.19 billion in 2017. 

Panel (C) shows the time series for the occupancy rate of hotels in South Florida over 
a one-year period. The highest occupancy rates, 95% and 98%, occur during the months 


FIGURE 1.1 U.S. Average Price per Gallon for Conventional Regular Gasoline 
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Source: Energy Information Administration, U.S. Department of Energy. 
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A Variety of Graphs of Time Series Data 
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of February and March when the climate of South Florida is attractive to tourists. In fact, 
January to April of each year is typically the high-occupancy season for South Florida hotels. 
On the other hand, note the low occupancy rates during the months of August to October, 
with the lowest occupancy rate of 50% occurring in September. High temperatures and the 
hurricane season are the primary reasons for the drop in hotel occupancy during this period. 


NOTES + COMMENTS 


1. An observation is the set of measurements obtained for 2. Quantitative data may be discrete or continuous. Quanti- 


each element in a data set. Hence, the number of obser- tative data that measure how many (e.g., number of calls 
vations is always the same as the number of elements. received in 5 minutes) are discrete. Quantitative data that 
The number of measurements obtained for each element measure how much (e.g., weight or time) are continuous 
equals the number of variables. Hence, the total number because no separation occurs between the possible data 
of data items can be determined by multiplying the num- values. 


ber of observations by the number of variables. 


1.3 Data Sources 


Data can be obtained from existing sources, by conducting an observational study, or by 
conducting an experiment. 


Existing Sources 


In some cases, data needed for a particular application already exist. Companies maintain 
a variety of databases about their employees, customers, and business operations. Data 
on employee salaries, ages, and years of experience can usually be obtained from internal 
personnel records. Other internal records contain data on sales, advertising expendit- 
ures, distribution costs, inventory levels, and production quantities. Most companies also 
maintain detailed data about their customers. Table 1.2 shows some of the data commonly 
available from internal company records. 

Organizations that specialize in collecting and maintaining data make available sub- 
stantial amounts of business and economic data. Companies access these external data 
sources through leasing arrangements or by purchase. Dun & Bradstreet, Bloomberg, and 
Dow Jones & Company are three firms that provide extensive business database services 
to clients. The Nielsen Company and IRI built successful businesses collecting and pro- 
cessing data that they sell to advertisers and product manufacturers. 


TABLE 1.2 Examples of Data Available from Internal Company Records 


Source Some of the Data Typically Available 
Employee records Name, address, social security number, salary, number of vacation 
days, number of sick days, and bonus 


Production records Part or product number, quantity produced, direct labor cost, and 
materials cost 


Inventory records Part or product number, number of units on hand, reorder level, 
economic order quantity, and discount schedule 


Sales records Product number, sales volume, sales volume by region, and sales 
volume by customer type 


Credit records Customer name, address, phone number, credit limit, and accounts 
receivable balance 


Customer profile Age, gender, income level, household size, address, and preferences 
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Data are also available from a variety of industry associations and special interest 
organizations. The U.S. Travel Association maintains travel-related information such as 
the number of tourists and travel expenditures by states. Such data would be of interest to 
firms and individuals in the travel industry. The Graduate Management Admission Council 
maintains data on test scores, student characteristics, and graduate management education 
programs. Most of the data from these types of sources are available to qualified users at a 
modest cost. 

The Internet is an important source of data and statistical information. Almost all 
companies maintain websites that provide general information about the company as well 
as data on sales, number of employees, number of products, product prices, and product 
specifications. In addition, a number of companies now specialize in making information 
available over the Internet. As a result, one can obtain access to stock quotes, meal prices at 
restaurants, salary data, and an almost infinite variety of information. 

Government agencies are another important source of existing data. For instance, the 
website DATA.GOV was launched by the U.S. government in 2009 to make it easier for the 
public to access data collected by the U.S. federal government. The DATA.GOV website 
includes over 150,000 data sets from a variety of U.S. federal departments and agencies, 
but there are many other federal agencies that maintain their own websites and data repos- 
itories. Table 1.3 lists selected governmental agencies and some of the data they provide. 
Figure 1.3 shows the home-page for the DATA.GOV website. Many state and local gov- 
ernments are also now providing data sets online. As examples, the states of California and 
Texas maintain open data portals at data.ca.gov and data.texas.gov, respectively. New York 
City’s open data website is opendata.cityofnewyork.us and the city of Cincinnati, Ohio, is 
at data.cincinnati-oh.gov. 


Observational Study 


In an observational study we simply observe what is happening in a particular situation, 
record data on one or more variables of interest, and conduct a statistical analysis of the 
resulting data. For example, researchers might observe a randomly selected group of 
customers that enter a Walmart supercenter to collect data on variables such as the length 
of time the customer spends shopping, the gender of the customer, and the amount spent. 
Statistical analysis of the data may help management determine how factors such as the 
length of time shopping and the gender of the customer affect the amount spent. 

As another example of an observational study, suppose that researchers were interested 
in investigating the relationship between the gender of the CEO for a Fortune 500 company 
and the performance of the company as measured by the return on equity (ROE). To obtain 


TABLE 1.3 Examples of Data Available from Selected Government Agencies 


Government Agency Some of the Data Available 

Census Bureau Population data, number of households, and household income 

Federal Reserve Board Data on the money supply, installment credit, exchange rates, 
and discount rates 

Office of Management Data on revenue, expenditures, and debt of the federal 

and Budget government 

Department of Commerce Data on business activity, value of shipments by industry, level of 


profits by industry, and growing and declining industries 


Bureau of Labor Statistics Consumer spending, hourly earnings, unemployment rate, 
safety records, and international statistics 


DATA.GOV More than150,000 data sets including agriculture, consumer, 
education, health, and manufacturing data 
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FIGURE 1.3 | DATA.GOV Homepage 
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data, the researchers selected a sample of companies and recorded the gender of the CEO 
and the ROE for each company. Statistical analysis of the data can help determine the rela- 
tionship between performance of the company and the gender of the CEO. This example is 
an observational study because the researchers had no control over the gender of the CEO 
or the ROE at each of the companies that were sampled. 

Surveys and public opinion polls are two other examples of commonly used observa- 
tional studies. The data provided by these types of studies simply enable us to observe 
opinions of the respondents. For example, the New York State legislature commissioned 
a telephone survey in which residents were asked if they would support or oppose an in- 
crease in the state gasoline tax in order to provide funding for bridge and highway repairs. 
Statistical analysis of the survey results will assist the state legislature in determining if it 

The largest experimen- should introduce a bill to increase gasoline taxes. 

tal statistical study ever 

conducted is believed to Experiment 

be the 1954 Public Health 


The key difference between an observational study and an experiment is that an experiment 
Service experiment for the 


is conducted under controlled conditions. As a result, the data obtained from a well-designed 
experiment can often provide more information as compared to the data obtained from exist- 
ing sources or by conducting an observational study. For example, suppose a pharmaceutical 
company would like to learn about how a new drug it has developed affects blood pressure. 
To obtain data about how the new drug affects blood pressure, researchers selected a sample 
of individuals. Different groups of individuals are given different dosage levels of the new 
drug, and before and after data on blood pressure are collected for each group. Statistical 

In Chapter 13 we discuss analysis of the data can help determine how the new drug affects blood pressure. 

statistical methods appro- The types of experiments we deal with in statistics often begin with the identification of a 
priate for analyzing the particular variable of interest. Then one or more other variables are identified and controlled so 
data from an experiment. that data can be obtained about how the other variables influence the primary variable of interest. 


Salk polio vaccine. Nearly 
2 million children in grades 
1, 2, and 3 were selected 
from throughout the 
United States. 
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Time and Cost Issues 


Anyone wanting to use data and statistical analysis as aids to decision making must be 
aware of the time and cost required to obtain the data. The use of existing data sources is 
desirable when data must be obtained in a relatively short period of time. If important data 
are not readily available from an existing source, the additional time and cost involved 

in obtaining the data must be taken into account. In all cases, the decision maker should 
consider the contribution of the statistical analysis to the decision-making process. The cost 
of data acquisition and the subsequent statistical analysis should not exceed the savings 
generated by using the information to make a better decision. 


Data Acquisition Errors 


Managers should always be aware of the possibility of data errors in statistical studies. 
Using erroneous data can be worse than not using any data at all. An error in data acquis- 
ition occurs whenever the data value obtained is not equal to the true or actual value that 
would be obtained with a correct procedure. Such errors can occur in a number of ways. 
For example, an interviewer might make a recording error, such as a transposition in writ- 
ing the age of a 24-year-old person as 42, or the person answering an interview question 
might misinterpret the question and provide an incorrect response. 
Experienced data analysts take great care in collecting and recording data to ensure that 
errors are not made. Special procedures can be used to check for internal consistency of the 
In Chapter 3 we present data. For instance, such procedures would indicate that the analyst should review the ac- 


some of the methods curacy of data for a respondent shown to be 22 years of age but reporting 20 years of work 
statisticians use to identify experience. Data analysts also review data with unusually large and small values, called 
outliers. outliers, which are candidates for possible data errors. 


Errors often occur during data acquisition. Blindly using any data that happen to be 
available or using data that were acquired with little care can result in misleading informa- 
tion and bad decisions. Thus, taking steps to acquire accurate data can help ensure reliable 
and valuable decision-making information. 


1.4 Descriptive Statistics 


Most of the statistical information in the media, company reports, and other publications 
consists of data that are summarized and presented in a form that is easy for the reader to 
understand. Such summaries of data, which may be tabular, graphical, or numerical, are 
referred to as descriptive statistics. 

Refer to the data set in Table 1.1 showing data for 60 nations that participate in the 
World Trade Organization. Methods of descriptive statistics can be used to summarize 
these data. For example, consider the variable Fitch Outlook, which indicates the direc- 
tion the nation’s credit rating is likely to move over the next two years. The Fitch Outlook 
is recorded as being negative, stable, or positive. A tabular summary of the data showing 
the number of nations with each of the Fitch Outlook ratings is shown in Table 1.4. A 
graphical summary of the same data, called a bar chart, is shown in Figure 1.4. These 
types of summaries make the data easier to interpret. Referring to Table 1.4 and 


TABLE 1.4 Frequencies and Percent Frequencies for the Fitch Credit 


Rating Outlook of 60 Nations 


Fitch Outlook Frequency Percent Frequency (%) 
Positive 4 G 
Stable 44 732 
Negative 12 20.0 
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FIGURE 1.4 Bar Chart for the Fitch Credit Rating Outlook for 60 Nations 
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Figure 1.4, we can see that the majority of Fitch Outlook credit ratings are stable, with 
73.2% of the nations having this rating. More nations have a negative outlook (20%) than 
a positive outlook (6.7%). 

A graphical summary of the data for quantitative variable Per Capita GDP in 
Table 1.1, called a histogram, is provided in Figure 1.5. Using the histogram, it is easy 


FIGURE 1.5 Histogram of Per Capita GDP for 60 Nations 
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Chapters 2 and 3 devote 
attention to the tabular, 
graphical, and numerical 
methods of descriptive 
statistics. 


The U.S. government 
conducts a census every 
10 years. Market research 
firms conduct sample 
surveys every day. 
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to see that Per Capita GDP for the 60 nations ranges from $0 to $80,000, with the 
highest concentration between $0 and $10,000. Only one nation had a Per Capita GDP 
exceeding $70,000. 

In addition to tabular and graphical displays, numerical descriptive statistics are used 
to summarize data. The most common numerical measure is the average, or mean. Using 
the data on Per Capita GDP for the 60 nations in Table 1.1, we can compute the average by 
adding Per Capita GDP for all 60 nations and dividing the total by 60. Doing so provides 
an average Per Capita GDP of $21,279. This average provides a measure of the central 
tendency, or central location of the data. 

There is a great deal of interest in effective methods for developing and presenting 
descriptive statistics. 


1.5 Statistical Inference 


Many situations require information about a large group of elements (individuals, compan- 
ies, voters, households, products, customers, and so on). But, because of time, cost, and 
other considerations, data can be collected from only a small portion of the group. The lar- 
ger group of elements in a particular study is called the population, and the smaller group 
is called the sample. Formally, we use the following definitions. 


POPULATION 
A population is the set of all elements of interest in a particular study. 


SAMPLE 


A sample is a subset of the population. 


The process of conducting a survey to collect data for the entire population is called a 
census. The process of conducting a survey to collect data for a sample is called a sample 
survey. As one of its major contributions, statistics uses data from a sample to make estim- 
ates and test hypotheses about the characteristics of a population through a process referred 
to as statistical inference. 

As an example of statistical inference, let us consider the study conducted by Rogers 
Industries. Rogers manufactures lithium batteries used in rechargeable electronics such as 
laptop computers and tablets. In an attempt to increase battery life for its products, Rogers 
has developed a new solid-state lithium battery that should last longer and be safer to use. 
In this case, the population is defined as all lithium batteries that could be produced using 
the new solid-state technology. To evaluate the advantages of the new battery, a sample of 
200 batteries manufactured with the new solid-state technology were tested. Data collec- 
ted from this sample showed the number of hours each battery lasted before needing to be 
recharged under controlled conditions. See Table 1.5. 

Suppose Rogers wants to use the sample data to make an inference about the average 
hours of battery life for the population of all batteries that could be produced with the 
new solid-state technology. Adding the 200 values in Table 1.5 and dividing the total by 
200 provides the sample average battery life: 18.84 hours. We can use this sample result 
to estimate that the average lifetime for the batteries in the population is 18.84 hours. 
Figure 1.6 provides a graphical summary of the statistical inference process for Rogers 
Industries. 

Whenever statisticians use a sample to estimate a population characteristic of interest, 
they usually provide a statement of the quality, or precision, associated with the estimate. 
For the Rogers Industries example, the statistician might state that the point estimate of the 
average battery life is 18.84 hours +.68 hours. Thus, an interval estimate of the average 
battery life is 18.16 to 19.52 hours. The statistician can also state how confident he or she 
is that the interval from 18.16 to 19.52 hours contains the population average. 
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TABLE 1.5 Hours Until Recharge for a Sample of 200 Batteries for the 


Rogers Industries Example 


Battery Life (hours) 
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FIGURE 1.6 The Process of Statistical Inference for the Rogers 
Industries Example 


1. Population 
consists of all batteries 2. A sample of 
manufactured with 200 batteries is 
the new solid-state manufactured with 
technology. Average the new solid-state technology. 


lifetime is unknown. 


4. The sample average 3. The sample data provide 
is used to estimate the a sample average lifetime 
population average. of 18.84 hours per battery. 


1.6 Statistical Analysis Using Microsoft Excel 


Because statistical analysis typically involves working with large amounts of data, com- 
puter software is frequently used to conduct the analysis. In this book we show how statist- 
ical analysis can be performed using Microsoft Excel. 
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We want to emphasize that this book is about statistics; it is not a book about spread- 
sheets. Our focus is on showing the appropriate statistical procedures for collecting, 
analyzing, presenting, and interpreting data. Because Excel is widely available in business 
organizations, you can expect to put the knowledge gained here to use in the setting where 
you currently, or soon will, work. If, in the process of studying this material, you become 
more proficient with Excel, so much the better. 

We begin most sections with an application scenario in which a statistical procedure is 
useful. After showing what the statistical procedure is and how it is used, we turn to show- 
ing how to implement the procedure using Excel. Thus, you should gain an understanding 
of what the procedure is, the situation in which it is useful, and how to implement it using 
the capabilities of Excel. 


Data Sets and Excel Worksheets 


To hide rows 15 through Data sets are organized in Excel worksheets in much the same way as the data set for 
54 of the Excel worksheet, the 60 nations that participate in the World Trade Organization that appears in Table 1.1 
first select rows 15 through is organized. Figure 1.7 shows an Excel worksheet for that data set. Note that row 1 and 
54. Then, right-click and column A contain labels. Cells B1:El contain the variable names; cells A2:A61 contain the 
choose the Hide option. To observation names; and cells B2:E61 contain the data that were collected. A purple fill 
redisplay rows 15 through color is used to highlight the cells that contain the data. Displaying a worksheet with this 
54, just select rows 14 many rows on a single page of a textbook is not practical. In such cases we will hide se- 
through 55, right-click, and lected rows to conserve space. In the Excel worksheet shown in Figure 1.7 we have hidden 
select the Unhide option. rows 15 through 54 (observations 14 through 53) to conserve space. 
The data are the focus of the statistical analysis. Except for the headings in row 1, each 
row of the worksheet corresponds to an observation and each column corresponds to a 
variable. For instance, row 2 of the worksheet contains the data for the first observation, 
Armenia; row 3 contains the data for the second observation, Australia; row 3 contains 
the data for the third observation, Austria; and so on. The names in column A provide a 


FIGURE 1.7 | Excel Worksheet for the 60 Nations That Participate 
in the World Trade Organization 
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FIGURE 1.8 | Excel Worksheet for the Rogers Industries Data Set 
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convenient way to refer to each of the 60 observations in the study. Note that column B of 
the worksheet contains the data for the variable WTO Status, column C contains the data 
for the Per Capita GDP ($), and so on. 

Suppose now that we want to use Excel to analyze the Rogers Industries data 
shown in Table 1.5. The data in Table 1.5 are organized into 10 columns with 20 data 
values in each column so that the data would fit nicely on a single page of the text. 
Even though the table has several columns, it shows data for only one variable (hours 
until burnout). In statistical worksheets it is customary to put all the data for each 
variable in a single column. Refer to the Excel worksheet shown in Figure 1.8. Note 
that rows 7 through 195 have been hidden to conserve space. 


Using Excel for Statistical Analysis 


To separate the discussion of a statistical procedure from the discussion of using Excel to 
implement the procedure, the material that discusses the use of Excel will usually be set 
apart in sections with headings such as Using Excel to Construct a Bar Chart and a Pie 
Chart, and Using Excel to Construct a Frequency Distribution. In using Excel for statistical 
analysis, four tasks may be needed: Enter/Access Data; Enter Functions and Formulas; 
Apply Tools; and Editing Options. 


Enter/Access Data: Select cell locations for the data and enter the data along with appropriate 
labels; or open an existing Excel file such as one of the files that accompany the text. 


Enter Functions and Formulas: Select cell locations, enter Excel functions and formulas, 
and provide descriptive labels to identify the results. 


Apply Tools: Use Excel’s tools for data analysis and presentation. 


Editing Options: Edit the results to better identify the output or to create a different type 
of presentation. For example, when using Excel’s chart tools, we can edit the chart that is 
created by adding, removing, or changing chart elements such as the title, legend, and data 
labels. 
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Our approach will be to describe how these tasks are performed each time we use Excel 
to implement a statistical procedure. It will always be necessary to enter data or open an 
existing Excel file. But, depending on the complexity of the statistical analysis, only one of 
the second or third tasks may be needed. 

To illustrate how the discussion of Excel will appear throughout the book, we will 
show how to use Excel’s AVERAGE function to compute the average lifetime for the 200 
batteries in Table 1.5. Refer to Figure 1.9 as we describe the tasks involved. The worksheet 
shown in the foreground of Figure 1.9 displays the data for the problem and shows the 
results of the analysis. It is called the value worksheet. The worksheet shown in the back- 
ground displays the Excel formula used to compute the average lifetime and is called the 
formula worksheet. A purple fill color is used to highlight the cells that contain the data in 
both worksheets. In addition, a green fill color is used to highlight the cells containing the 
functions and formulas in the formula worksheet and the corresponding results in the value 
worksheet. 


f Enter/Access Data: Open the file Rogers. The data are in cells A2:A201 and the label is in 
DATA file en p j 


Rogers 


(@ 


Enter Functions and Formulas: Excel’s AVERAGE function can be used to compute the 
mean by entering the following formula in cell E2: 


=AVERAGE(A2:A201) 


Similarly, the formula =MEDIAN(A2:A201/) is entered in cell E3 to compute the median. 

To identify the result, the label “Average Lifetime” is entered in cell D2 and the 
label “Median Lifetime” is entered into cell D3. Note that for this illustration the 
Apply Tools and Editing Options tasks were not required. The value worksheet shows 
that the value computed using the AVERAGE function is 18.84 hours and that the 
value computed using the MEDIAN function is also 18.84 hours. 


FIGURE 1.9 | Computing the Average Lifetime of Batteries for Rogers Industries Using Excel's 
Average Function 
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1.7 Analytics 


Because of the dramatic increase in available data, more cost-effective data storage, faster 

computer processing, and recognition by managers that data can be extremely valuable for 

understanding customers and business operations, there has been a dramatic increase in 

data-driven decision making. The broad range of techniques that may be used to support 

data-driven decisions comprise what has become known as analytics. 

Analytics is the scientific process of transforming data into insights for making better 

We adopt the definition decisions. Analytics is used for data-driven or fact-based decision making, which is often 
of analytics developed by seen as more objective than alternative approaches to decision making. The tools of analyt- 
the Institute for Operations ics can aid decision making by creating insights from data, improving our ability to more 


Research and the accurately forecast for planning, helping us quantify risk, and yielding better alternatives 
Management Sciences through analysis. 
(INFORMS). Analytics can involve a variety of techniques from simple reports to the most advanced 


optimization techniques (algorithms for finding the best course of action). Analytics is now 
generally thought to comprise three broad categories of techniques. These categories are 
descriptive analytics, predictive analytics, and prescriptive analytics. 

Descriptive analytics encompasses the set of analytical techniques that describe what 
has happened in the past. Examples of these types of techniques are data queries, reports, 
descriptive statistics, data visualization, data dash boards, and basic what-if spreadsheet 
models. 

Predictive analytics consists of analytical techniques that use models constructed 
from past data to predict the future or to assess the impact of one variable on another. 
For example, past data on sales of a product may be used to construct a mathematical 
model that predicts future sales. Such a model can account for factors such as the growth 
trajectory and seasonality of the product’s sales based on past growth and seasonal 
patterns. Point-of-sale scanner data from retail outlets may be used by a packaged 
food manufacturer to help estimate the lift in unit sales associated with coupons or 
sales events. Survey data and past purchase behavior may be used to help predict the 
market share of a new product. Each of these is an example of predictive analytics. 
Linear regression, time series analysis, and forecasting models fall into the category of 
predictive analytics; these techniques are discussed later in this text. Simulation, which is 
the use of probability and statistical computer models to better understand risk, also falls 
under the category of predictive analytics. 

Prescriptive analytics differs greatly from descriptive or predictive analytics. What 
distinguishes prescriptive analytics is that prescriptive models yield a best course of action 
to take. That is, the output of a prescriptive model is a best decision. Hence, prescriptive 
analytics is the set of analytical techniques that yield a course of action. Optimization 
models, which generate solutions that maximize or minimize some objective subject to a 
set of constraints, fall into the category of prescriptive models. The airline industry’s use of 
revenue management is an example of a prescriptive model. The airline industry uses past 
purchasing data as inputs into a model that recommends the pricing strategy across all flights 
that will maximize revenue for the company. 

How does the study of statistics relate to analytics? Most of the techniques in de- 
scriptive and predictive analytics come from probability and statistics. These include 
descriptive statistics, data visualization, probability and probability distributions, samp- 
ling, and predictive modeling, including regression analysis and time series forecasting. 
Each of these techniques is discussed in this text. The increased use of analytics for 
data-driven decision making makes it more important than ever for analysts and mana- 
gers to understand statistics and data analysis. Companies are increasingly seeking 
data savvy managers who know how to use descriptive and predictive models to make 
data-driven decisions. 

At the beginning of this section, we mentioned the increased availability of data as one 
of the drivers of the interest in analytics. In the next section we discuss this explosion in 
available data and how it relates to the study of statistics. 
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1.8 Big Data and Data Mining 


With the aid of credit-card readers, bar code scanners, point-of-sale terminals, and 
online data collection, most organizations obtain large amounts of data on a daily 
basis. And, even for a small local restaurant that uses touch screen monitors to enter 
orders and handle billing, the amount of data collected can be substantial. For large 
retail companies, the sheer volume of data collected is hard to conceptualize, and 
figuring out how to effectively use these data to improve profitability is a challenge. 
Mass retailers such as Walmart capture data on 20 to 30 million transactions every 
day, telecommunication companies such as France Telecom and AT&T generate over 
300 million call records per day, and Visa processes 6800 payment transactions per 
second or approximately 600 million transactions per day. 

In addition to the sheer volume and speed with which companies now collect data, 
more complicated types of data are now available and are proving to be of great value to 
businesses. Text data are collected by monitoring what is being said about a company’s 
products or services on social media such as Twitter. Audio data are collected from service 
calls (on a service call, you will often hear “this call may be monitored for quality con- 
trol”). Video data are collected by in-store video cameras to analyze shopping behavior. 
Analyzing information generated by these nontraditional sources is more complicated 
because of the complex process of transforming the information into data that can be 
analyzed. 

Larger and more complex data sets are now often referred to as big data. Although 
there does not seem to be a universally accepted definition of big data, many think of it 
as a set of data that cannot be managed, processed, or analyzed with commonly available 
software in a reasonable amount of time. Many data analysts define big data by referring to 
the four v’s of data: volume, velocity, variety, and veracity. Volume refers to the amount of 
available data (the typical unit of measure for data is now a terabyte, which is 10" bytes); 
velocity refers to the speed at which data is collected and processed; variety refers to the 
different data types; and veracity refers to the reliability of the data generated. 

The term data warehousing is used to refer to the process of capturing, storing, and 
maintaining the data. Computing power and data collection tools have reached the 
point where it is now feasible to store and retrieve extremely large quantities of data in 
seconds. Analysis of the data in the warehouse may result in decisions that will lead to 
new Strategies and higher profits for the organization. For example, General Electric (GE) 
captures a large amount of data from sensors on its aircraft engines each time a plane takes 
off or lands. Capturing these data allows GE to offer an important service to its customers; 
GE monitors the engine performance and can alert its customer when service is needed or a 
problem is likely to occur. 

The subject of data mining deals with methods for developing useful decision-making 
information from large databases. Using a combination of procedures from statistics, math- 
ematics, and computer science, analysts “mine the data” in the warehouse to convert it into 
useful information, hence the name data mining. Dr. Kurt Thearling, a leading practitioner 
in the field, defines data mining as “the automated extraction of predictive information 
from (large) databases.” The two key words in Dr. Thearling’s definition are “automated” 
and “predictive.” Data mining systems that are the most effective use automated procedures 
to extract information from the data using only the most general or even vague queries by 
the user. And data mining software automates the process of uncovering hidden predictive 
information that in the past required hands-on analysis. 

The major applications of data mining have been made by companies with a strong 
consumer focus, such as retail businesses, financial organizations, and communication 
companies. Data mining has been successfully used to help retailers such as Amazon and 
Barnes & Noble determine one or more related products that customers who have already 
purchased a specific product are also likely to purchase. Then, when a customer logs on 
to the company’s website and purchases a product, the website uses pop-ups to alert the 
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customer about additional products that the customer is likely to purchase. In another ap- 
plication, data mining may be used to identify customers who are likely to spend more than 
$20 on a particular shopping trip. These customers may then be identified as the ones to 
receive special e-mail or regular mail discount offers to encourage them to make their next 
shopping trip before the discount termination date. 

Statistical methods play an Data mining is a technology that relies heavily on statistical methodology such as mul- 

important role in data tiple regression, logistic regression, and correlation. But it takes a creative integration of all 

mining, both interms of these methods and computer science technologies involving artificial intelligence and ma- 

discovering relationships in chine learning to make data mining effective. A substantial investment in time and money 

the data and predicting is required to implement commercial data mining software packages developed by firms 

future outcomes. However, such as Oracle, Teradata, and SAS. However, open-source software such as R and Python 

a thorough coverage of are also very popular tools for performing data mining. The statistical concepts introduced 

data mining andthe use in this text will be helpful in understanding the statistical methodology used by data mining 

of statistics in data mining software packages and enable you to better understand the statistical information that 

is outside the scope of is developed. 

this text. Because statistical models play an important role in developing predictive models in 
data mining, many of the concerns that statisticians deal with in developing statistical mod- 
els are also applicable. For instance, a concern in any statistical study involves the issue of 
model reliability. Finding a statistical model that works well for a particular sample of data 
does not necessarily mean that it can be reliably applied to other data. One of the common 
statistical approaches to evaluating model reliability is to divide the sample data set into 
two parts: a training data set and a test data set. If the model developed using the training 
data is able to accurately predict values in the test data, we say that the model is reliable. 
One advantage that data mining has over classical statistics is that the enormous amount of 
data available allows the data mining software to partition the data set so that a model de- 
veloped for the training data set may be tested for reliability on other data. In this sense, the 
partitioning of the data set allows data mining to develop models and relationships and then 
quickly observe if they are repeatable and valid with new and different data. On the other 
hand, a warning for data mining applications is that with so much data available, there is a 
danger of overfitting the model to the point that misleading associations and cause/effect 
conclusions appear to exist. Careful interpretation of data mining results and additional 
testing will help avoid this pitfall. 


1.9 Ethical Guidelines for Statistical Practice 


Ethical behavior is something we should strive for in all that we do. Ethical issues arise in 
statistics because of the important role statistics plays in the collection, analysis, presenta- 
tion, and interpretation of data. In a statistical study, unethical behavior can take a variety 
of forms including improper sampling, inappropriate analysis of the data, development of 
misleading graphs, use of inappropriate summary statistics, and/or a biased interpretation 
of the statistical results. 

As you begin to do your own statistical work, we encourage you to be fair, thorough, 
objective, and neutral as you collect data, conduct analyses, make oral presentations, and 
present written reports containing information developed. As a consumer of statistics, 
you should also be aware of the possibility of unethical statistical behavior by others. 
When you see statistics in the media, it is a good idea to view the information with some 
skepticism, always being aware of the source as well as the purpose and objectivity of the 
statistics provided. 

The American Statistical Association, the nation’s leading professional organization 
for statistics and statisticians, developed the report “Ethical Guidelines for Statistical 
Practice”? to help statistical practitioners make and communicate ethical decisions and 
assist students in learning how to perform statistical work responsibly. The report contains 


2American Statistical Association, “Ethical Guidelines for Statistical Practice,” April 2018. 
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52 guidelines organized into eight topic areas: Professional Integrity and Accountability; 
Integrity of Data and Methods; Responsibilities to Science/Public/Funder/Client; 
Responsibilities to Research Subjects; Responsibilities to Research Team Colleagues; 
Responsibilities to Other Statisticians or Statistics Practitioners; Responsibilities Regarding 
Allegations of Misconduct; and Responsibilities of Employers Including Organizations, 
Individuals, Attorneys, or Other Clients Employing Statistical Practitioners. 

One of the ethical guidelines in the Professional Integrity and Accountability area ad- 
dresses the issue of running multiple tests until a desired result is obtained. Let us consider 
an example. In Section 1.5 we discussed a statistical study conducted by Rogers Indus- 
tries involving a sample of 200 lithium batteries manufactured with a new solid-state 
technology. The average battery life for the sample, 18.84 hours, provided an estimate 
of the average lifetime for all lithium batteries produced with the new solid-state tech- 
nology. However, since Rogers selected a sample of batteries, it is reasonable to assume 
that another sample would have provided a different average battery life. 

Suppose Rogers’s management had hoped the sample results would enable them to 
claim that the average time until recharge for the new batteries was 20 hours or more. Sup- 
pose further that Rogers’s management decides to continue the study by manufacturing and 
testing repeated samples of 200 batteries with the new solid-state technology until a sample 
mean of 20 hours or more is obtained. If the study is repeated enough times, a sample 
may eventually be obtained—by chance alone—that would provide the desired result and 
enable Rogers to make such a claim. In this case, consumers would be misled into thinking 
the new product is better than it actually is. Clearly, this type of behavior is unethical and 
represents a gross misuse of statistics in practice. 

Several ethical guidelines in the Integrity of Data and Methods area deal with issues 
involving the handling of data. For instance, a statistician must account for all data con- 
sidered in a study and explain the sample(s) actually used. In the Rogers Industries study 
the average battery life for the 200 batteries in the original sample is 18.84 hours; this is 
less than the 20 hours or more that management hoped to obtain. Suppose now that after 
reviewing the results showing a 18.84 hour average battery life, Rogers discards all the 
observations with 18 or less hours until recharge, allegedly because these batteries contain 
imperfections caused by startup problems in the manufacturing process. After discarding 
these batteries, the average lifetime for the remaining batteries in the sample turns out to be 
22 hours. Would you be suspicious of Rogers’s claim that the battery life for its new solid- 
state batteries is 22 hours? 

If the Rogers batteries showing 18 or less hours until recharge were discarded to simply 
provide an average lifetime of 22 hours, there is no question that discarding the batteries 
with 18 or fewer hours until recharge is unethical. But, even if the discarded batteries con- 
tain imperfections due to startup problems in the manufacturing process—and, as a result, 
should not have been included in the analysis—the statistician who conducted the study 
must account for all the data that were considered and explain how the sample actually 
used was obtained. To do otherwise is potentially misleading and would constitute uneth- 
ical behavior on the part of both the company and the statistician. 

A guideline in the Professional Integrity and Accountability section of the American 
Statistical Association report states that statistical practitioners should avoid any tendency 
to slant statistical work toward predetermined outcomes. This type of unethical practice 
is often observed when unrepresentative samples are used to make claims. For instance, 
in many areas of the country smoking is not permitted in restaurants. Suppose, however, 

a lobbyist for the tobacco industry interviews people in restaurants where smoking is per- 
mitted in order to estimate the percentage of people who are in favor of allowing smoking 
in restaurants. The sample results show that 90% of the people interviewed are in favor of 
allowing smoking in restaurants. Based upon these sample results, the lobbyist claims that 
90% of all people who eat in restaurants are in favor of permitting smoking in restaurants. 
In this case we would argue that only sampling persons eating in restaurants that allow 
smoking has biased the results. If only the final results of such a study are reported, readers 
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unfamiliar with the details of the study (i.e., that the sample was collected only in restaur- 
ants allowing smoking) can be misled. 

The scope of the American Statistical Association’s report is broad and includes ethical 
guidelines that are appropriate not only for a statistician, but also for consumers of statistical 
information. We encourage you to read the report to obtain a better perspective of ethical 
issues as you continue your study of statistics and to gain the background for determining 
how to ensure that ethical standards are met when you start to use statistics in practice. 


SUMMARY 


eeoeceeeeeeeeeeeeeeeeeeeeeeeeeee 


Statistics is the art and science of collecting, analyzing, presenting, and interpreting data. 
Nearly every college student majoring in business or economics is required to take a course 
in statistics. We began the chapter by describing typical statistical applications for business 
and economics. 

Data consist of the facts and figures that are collected and analyzed. The four scales of 
measurement used to obtain data on a particular variable are nominal, ordinal, interval, and 
ratio. The scale of measurement for a variable is nominal when the data are labels or names 
used to identify an attribute of an element. The scale is ordinal if the data demonstrate the 
properties of nominal data and the order or rank of the data is meaningful. The scale is in- 
terval if the data demonstrate the properties of ordinal data and the interval between values 
is expressed in terms of a fixed unit of measure. Finally, the scale of measurement is ratio 
if the data show all the properties of interval data and the ratio of two values is meaningful. 

For purposes of statistical analysis, data can be classified as categorical or quantitative. 
Categorical data use labels or names to identify an attribute of each element. Categor- 
ical data use either the nominal or ordinal scale of measurement and may be nonnumeric 
or numeric. Quantitative data are numeric values that indicate how much or how many. 
Quantitative data use either the interval or ratio scale of measurement. Ordinary arithmetic 
operations are meaningful only if the data are quantitative. Therefore, statistical computa- 
tions used for quantitative data are not always appropriate for categorical data. 

In Sections 1.4 and 1.5 we introduced the topics of descriptive statistics and statistical 
inference. Descriptive statistics are the tabular, graphical, and numerical methods used 
to summarize data. The process of statistical inference uses data obtained from a sample 
to make estimates or test hypotheses about the characteristics of a population. The last four 
sections of the chapter provide information on the role of computers in statistical analysis, 
an introduction to the relatively new fields of analytics, data mining, and big data, and a 
summary of ethical guidelines for statistical practice. 


GLOSSARY 
Analytics The scientific process of transforming data into insights for making better de- 
cisions. 

Big data A set of data that cannot be managed, processed, or analyzed with commonly 
available software in a reasonable amount of time. Big data are characterized by great 
volume (a large amount of data), high velocity (fast collection and processing), or wide 
variety (could include nontraditional data such as video, audio, and text). 

Categorical data Labels or names used to identify an attribute of each element. Categor- 
ical data use either the nominal or ordinal scale of measurement and may be nonnumeric or 
numeric. 

Categorical variable A variable with categorical data. 

Census A survey to collect data on the entire population. 

Cross-sectional data Data collected at the same or approximately the same point in time. 
Data The facts and figures collected, analyzed, and summarized for presentation and 
interpretation. 
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Data mining The process of using procedures from statistics and computer science to 
extract useful information from extremely large databases. 

Data set All the data collected in a particular study. 

Descriptive analytics Analytical techniques that describe what has happened in the past. 
Descriptive statistics Tabular, graphical, and numerical summaries of data. 

Elements The entities on which data are collected. 

Interval scale The scale of measurement for a variable if the data demonstrate the proper- 
ties of ordinal data and the interval between values is expressed in terms of a fixed unit of 
measure. Interval data are always numeric. 

Nominal scale The scale of measurement for a variable when the data are labels or names 
used to identify an attribute of an element. Nominal data may be nonnumeric or numeric. 
Observation The set of measurements obtained for a particular element. 

Ordinal scale The scale of measurement for a variable if the data exhibit the properties of 
nominal data and the order or rank of the data is meaningful. Ordinal data may be nonnu- 
meric or numeric. 

Population The set of all elements of interest in a particular study. 

Predictive analytics Analytical techniques that use models constructed from past data to 
predict the future or assess the impact of one variable on another. 

Prescriptive analytics Analytical techniques that yield a course of action. 

Quantitative data Numeric values that indicate how much or how many of something. 
Quantitative data are obtained using either the interval or ratio scale of measurement. 
Quantitative variable A variable with quantitative data. 

Ratio scale The scale of measurement for a variable if the data demonstrate all the properties 
of interval data and the ratio of two values is meaningful. Ratio data are always numeric. 
Sample A subset of the population. 

Sample survey A survey to collect data on a sample. 

Statistical inference The process of using data obtained from a sample to make estimates 
or test hypotheses about the characteristics of a population. 

Statistics The art and science of collecting, analyzing, presenting, and interpreting data. 
Time series data Data collected over several time periods. 

Variable A characteristic of interest for the elements. 


SUPPLEMENTARY EXERCISES 


1. Discuss the differences between statistics as numerical facts and statistics as a 
discipline or field of study. 

2. Comparing Tablet Computers. Tablet PC Comparison provides a wide variety 
of information about tablet computers. Their website enables consumers to easily 
compare different tablets using factors such as cost, type of operating system, display 
size, battery life, and CPU manufacturer. A sample of 10 tablet computers is shown in 
Table 1.6 (Tablet PC Comparison website). 
a. How many elements are in this data set? 
b. How many variables are in this data set? 
c. Which variables are categorical and which variables are quantitative? 
d. What type of measurement scale is used for each of the variables? 

3. Tablet PCs: Cost, CPU, and Operating System. Refer to Table 1.6. 
a. What is the average cost for the tablets? 
b. Compare the average cost of tablets with a Windows operating system to the aver- 

age cost of tablets with an Android operating system. 

c. What percentage of tablets use a CPU manufactured by TI OMAP? 
d. What percentage of tablets use an Android operating system? 

4. Comparing Phones. Table 1.7 shows data for eight phones (Consumer Reports). The 
Overall Score, a measure of the overall quality for the phone, ranges from 0 to 100. 
Voice Quality has possible ratings of poor, fair, good, very good, and excellent. Talk 
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TABLE 1.6 Product Information for 10 Tablet Computers 


Operating Display Size Battery Life CPU 


Tablet Cost ($) System (inches) (hours) Manufacturer 
Acer Iconia W510 599 Windows 10.1 8.5 Intel 
Amazon Kindle Fire HD 299 Android 8.9 O) TI OMAP 
Apple iPad 4 499 iOS 27. 11 Apple 
HP Envy X2 860 Windows 11.6 8 Intel 
Lenovo ThinkPad Tablet 668 Windows 10.1 10.5 Intel 
Microsoft Surface Pro 899 Windows 10.6 4 Intel 
Motorola Droid XYboard 530 Android 10.1 9 TI OMAP 
Samsung Ativ Smart PC 590 Windows 11.6 7 Intel 
Samsung Galaxy Tab 525 Android 10.1 10 Nvidia 
Sony Tablet S 360 Android 9.4 8 Nvidia 


Time is the manufacturer’s claim of how long the phone can be used when it is 
fully charged. 
a. How many elements are in this data set? 
b. For the variables Price, Overall Score, Voice Quality, and Talk Time, which vari- 
ables are categorical and which variables are quantitative? 
c. What scale of measurement is used for each variable? 
5. Summarizing Phone Data. Refer to the data set in Table 1.7. 
a. What is the average price for the phones? 
b. What is the average talk time for the phones? 
c. What percentage of the phones have a voice quality of excellent? 

6. New Automobile Owners Survey. J.D. Power and Associates surveys new automobile 
owners to learn about the quality of recently purchased vehicles. The following 
questions were asked in the J.D. Power Initial Quality Survey. 

a. Did you purchase or lease the vehicle? 

b. What price did you pay? 

c. What is the overall attractiveness of your vehicle’s exterior? (Unacceptable, 
Average, Outstanding, or Truly Exceptional) 

d. What is your average number of miles per gallon? 

e. What is your overall rating of your new vehicle? (1- to 10-point scale with 
1 Unacceptable and 10 Truly Exceptional) 


Comment on whether each question provides categorical or quantitative data. 


TABLE 1.7 Data for Eight Phones 


Talk Time 
Brand Model Price ($) | Overall Score Voice Quality (hours) 
AT&T CL84100 60 Iss Excellent i 
AT&T TL92271 80 70 Very Good 7 
Panasonic 4773B 100 78 Very Good 13 
Panasonic 6592T 70 72 Very Good 1s) 
Uniden D2997 45 70 Very Good 10 
Uniden D1788 80 T3 Very Good m 
Vtech DS6521 60 V2 Excellent 7 
Vtech CS6649 50 72 Very Good 7 
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7. Airline Customer Satisfaction. Many service companies collect data via a follow- 
up survey of their customers. For example, to ascertain customer sentiment, Delta 
Air Lines sends an e-mail to customers immediately following a flight. Among other 
questions, Delta asks: 


How likely are you to recommend Delta Air Lines to others? 


The possible responses are: 


Definitely Probably May or May Probably Will Definitely Will 
Will Will Not Not Not 


O O O O O 


a. Are the data collected by Delta in this example quantitative or categorical? 
b. What measurement scale is used? 

8. Readership Poll. The Tennessean, an online newspaper located in Nashville, 
Tennessee, conducts a daily poll to obtain reader opinions on a variety of current 
issues. In a recent poll, 762 readers responded to the following question: “If a 
constitutional amendment to ban a state income tax is placed on the ballot in 
Tennessee, would you want it to pass?” Possible responses were Yes, No, or Not Sure 
(The Tennessean website). 

a. What was the sample size for this poll? 

b. Are the data categorical or quantitative? 

c. Would it make more sense to use averages or percentages as a summary of the data 
for this question? 

d. Of the respondents, 67% said Yes, they would want it to pass. How many individu- 
als provided this response? 

9. College-Educated Workers. Based on data from the U.S. Census Bureau, a Pew 
Research study showed that the percentage of employed individuals ages 25-29 who 
are college educated is at an all-time high. The study showed that the percentage of 
employed individuals aged 25-29 with at least a bachelor’s degree in 2016 was 40%. 
In the year 2000, this percentage was 32%, in 1985 it was 25%, and in 1964 it was 
only 16% (Pew Research website). 

a. What is the population being studied in each of the four years in which Pew has 
data? 

b. What question was posed to each respondent? 

c. Do responses to the question provide categorical or quantitative data? 

10. Driving with Cell Phones. The Bureau of Transportation Statistics Omnibus 

Household Survey is conducted annually and serves as an information source for 

the U.S. Department of Transportation. In one part of the survey the person being 

interviewed was asked to respond to the following statement: “Drivers of motor 

vehicles should be allowed to talk on a hand-held cell phone while driving.” Possible 

responses were strongly agree, somewhat agree, somewhat disagree, and strongly 

disagree. Forty-four respondents said that they strongly agree with this statement, 

130 said that they somewhat agree, 165 said they somewhat disagree, and 741 said 

they strongly disagree with this statement. 

a. Do the responses for this statement provide categorical or quantitative data? 

b. Would it make more sense to use averages or percentages as a summary of the 
responses for this statement? 

c. What percentage of respondents strongly agree with allowing drivers of motor 
vehicles to talk on a hand-held cell phone while driving? 

d. Do the results indicate general support for or against allowing drivers of motor 
vehicles to talk on a hand-held cell phone while driving? 

11. Driverless Cars Expected Soon. A Gallup Poll utilizing a random sample of 1,503 
adults ages 18 or older was conducted in April 2018. The survey indicated a majority 
of Americans (53%) say driverless cars will be common in the next 10 years (Gallup, 
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FIGURE 1.10 Histogram of Survey Results on Driverless Cars 


40 


Percentage of Responses 


https://news.gallup.com/poll/234 152/americans-expect-driverless-cars-common-next 
-decade.aspx). The question asked was: 


Thinking about fully automated, “driverless cars,” cars that use technology to drive 
and do not need a human driver, based on what you have heard or read, how soon 
do you think driverless cars will be commonly used in the [United States]? 


Figure 1.10 shows a summary of results of the survey in a histogram indicating the 
percentage of the total responses in different time intervals. 
a. Are the responses to the survey question quantitative or categorical? 
b. How many of the respondents said that they expect driverless cars to be common in 
the next 10 years? 

c. How many respondents answered in the range 16-20 years? 

12. Hawaii Visitors Poll. The Hawaii Visitors Bureau collects data on visitors to Hawaii. 
The following questions were among 16 asked in a questionnaire handed out to 
passengers during incoming airline flights. 


e This trip to Hawaii is my: Ist, 2nd, 3rd, 4th, etc. 
e The primary reason for this trip is: (10 categories, including vacation, convention, 
honeymoon) 
e Where I plan to stay: (11 categories, including hotel, apartment, relatives, 
camping) 
e Total days in Hawaii 
a. What is the population being studied? 
b. Is the use of a questionnaire a good way to reach the population of passengers on 
incoming airline flights? 
c. Comment on each of the four questions in terms of whether it will provide categor- 
ical or quantitative data. 
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FIGURE 1.11 Facebook Worldwide Advertising Revenue 
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13. Facebook Advertising Revenue. Figure 1.11 provides a bar chart showing the annual 
advertising revenue for Facebook from 2010 to 2017 (Facebook Annual Reports). 
a. What is the variable of interest? 
b. Are the data categorical or quantitative? 
c. Are the data time series or cross-sectional? 
d. Comment on the trend in Facebook’s annual advertising revenue over time. 

14. Rental Car Fleet Size. The following data show the number of rental cars in service 
for three rental car companies: Hertz, Avis, and Dollar, over a four-year period. 


Cars in Service (1000s) 


Company Year 1 Year 2 Year 3 Year 4 
Hertz 327 Sil 286 290 
Dollar 167 140 106 108 
Avis 204 220 300 270 


a. Construct a time series graph for the years 1 to 4 showing the number of rental cars 
in service for each company. Show the time series for all three companies on the 
same graph. 

b. Comment on who appears to be the market share leader and how the market shares 
are changing over time. 

c. Construct a bar chart showing rental cars in service for Year 4. Is this chart based on 
cross-sectional or time series data? 

15. Jewelry Sales. The U.S. Census Bureau tracks sales per month for various products 
and services through its Monthly Retail Trade Survey. Figure 1.12 shows monthly 
jewelry sales in millions of dollars for 2016. 

a. Are the data quantitative or categorical? 

. Are the data cross-sectional or time series? 

. Which four months have the highest sales? 

. Why do you think the answers to part c might be the highest four months? 


aop T 
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FIGURE 1.12 Estimated Monthly Jewelry Sales in the United States for 2016 
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Source: The U.S. Census Bureau tracks sales per month for various products and services through its Monthly 
Retail Trade Survey (https://www.census. gov/retail/mrts/historic_releases.html). 


16. Athletic Shoe Sales. Skechers U.S.A., Inc., is a performance footwear company 
headquartered in Manhattan Beach, California. The sales revenue for Skechers over a 
four-year period are as follows: 


Year Sales ($ Billion) 
Year 1 2.30 
Year 2 315 
Year 3 3.56 
Year 4 4.16 


a. Are these cross-sectional or time-series data? 

b. Construct a bar graph similar to Figure 1.12. 

c. What can you say about how Skechers’ sales are changing over these four years? 

17. Deciding on a Salary Increase. A manager of a large corporation recommends a 
$10,000 raise be given to keep a valued subordinate from moving to another company. 
What internal and external sources of data might be used to decide whether such a 
salary increase is appropriate? 

18. Tax Survey. A random telephone survey of 1021 adults (aged 18 and older) was 
conducted by Opinion Research Corporation on behalf of CompleteTax, an online tax 
preparation and e-filing service. The survey results showed that 684 of those surveyed 
planned to file their taxes electronically. 

a. Develop a descriptive statistic that can be used to estimate the percentage of all 
taxpayers who file electronically. 

b. The survey reported that the most frequently used method for preparing the tax 
return was to hire an accountant or professional tax preparer. If 60% of the people 
surveyed had their tax return prepared this way, how many people used an account- 
ant or professional tax preparer? 
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c. Other methods that the person filing the return often used include manual pre- 
paration, use of an online tax service, and use of a software tax program. Would 
the data for the method for preparing the tax return be considered categorical or 
quantitative? 

19. Magazine Subscriber Survey. A Bloomberg Businessweek North American 
subscriber study collected data from a sample of 2861 subscribers. Fifty-nine percent 
of the respondents indicated an annual income of $75,000 or more, and 50% reported 
having an American Express credit card. 

a. What is the population of interest in this study? 

Is annual income a categorical or quantitative variable? 

Is ownership of an American Express card a categorical or quantitative variable? 

. Does this study involve cross-sectional or time series data? 

Describe any statistical inferences Bloomberg Businessweek might make on the 

basis of the survey. 

20. Investment Manager Survey. A survey of 131 investment managers in Barron’s Big 
Money poll revealed the following: 


papp 


e Forty-three percent of managers classified themselves as bullish or very bullish on 
the stock market. 

e The average expected return over the next 12 months for equities was 11.2%. 

e Twenty-one percent selected health care as the sector most likely to lead the mar- 
ket in the next 12 months. 

e When asked to estimate how long it would take for technology and telecom stocks 
to resume sustainable growth, the managers’ average response was 2.5 years. 

a. Cite two descriptive statistics. 

b. Make an inference about the population of all investment managers concerning the 
average return expected on equities over the next 12 months. 

c. Make an inference about the length of time it will take for technology and telecom 
stocks to resume sustainable growth. 

21. Cancer Research. A seven-year medical research study reported that women whose 
mothers took the drug DES during pregnancy were twice as likely to develop tissue 
abnormalities that might lead to cancer as were women whose mothers did not take 
the drug. 

a. This study compared two populations. What were the populations? 

b. Do you suppose the data were obtained in a survey or an experiment? 

c. For the population of women whose mothers took the drug DES during preg- 
nancy, a sample of 3,980 women showed that 63 developed tissue abnormalit- 
ies that might lead to cancer. Provide a descriptive statistic that could be used 
to estimate the number of women out of 1,000 in this population who have 
tissue abnormalities. 

d. For the population of women whose mothers did not take the drug DES during 
pregnancy, what is the estimate of the number of women out of 1,000 who would 
be expected to have tissue abnormalities? 

e. Medical studies often use a relatively large sample (in this case, 3,980). Why? 

22. Why People Move. A survey conducted by Better Homes and Gardens Real Estate 
LLC showed that one in five U.S. homeowners have either moved from their home 
or would like to move because their neighborhood or community isn’t ideal for their 
lifestyle (Better Homes and Gardens Real Estate website). The top lifestyle priorities 
of respondents when searching for their next home include ease of commuting by 
car, access to health and safety services, family-friendly neighborhood, availability 
of retail stores, access to cultural activities, public transportation access, and nightlife 
and restaurant access. Suppose a real estate agency in Denver, Colorado, hired you to 
conduct a similar study to determine the top lifestyle priorities for clients that currently 
have a home listed for sale with the agency or have hired the agency to help them 
locate a new home. 
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a. What is the population for the survey you will be conducting? 
b. How would you collect the data for this study? 

23. Teenage Cell Phone Use. Pew Research Center is a nonpartisan polling organization 
that provides information about issues, attitudes, and trends shaping America. In a 
2015 poll, Pew researchers found that 73% of teens aged 13—17 have a smartphone, 
15% have a basic phone and 12% have no phone. The study also asked the respondents 
how they communicated with their closest friend. Of those with a smartphone, 

58% responded texting, 17% social media and 10% phone calls. Of those with no 

smartphone, 25% responded texting, 29% social media and 21% phone calls (Pew 

Research Center website). 

a. One statistic (58%) concerned the use of texting to contact his/her closest friend, if 
the teen owns a smartphone. To what population is that applicable? 

b. Another statistic (25%) concerned the use of texting by those who do not own a 
smartphone. To what population is that applicable? 

c. Do you think the Pew researchers conducted a census or a sample survey to obtain 
their results? Why? 

24. Midterm Grades. A sample of midterm grades for five students showed the following 
results: 72, 65, 82, 90, 76. Which of the following statements are correct, and which 
should be challenged as being too generalized? 

a. The average midterm grade for the sample of five students is 77. 

The average midterm grade for all students who took the exam is 77. 

An estimate of the average midterm grade for all students who took the exam is 77. 

. More than half of the students who take this exam will score between 70 and 85. 

If five other students are included in the sample, their grades will be between 

65 and 90. 

25. Comparing Compact SUVs. Consumer Reports evaluates products for consumers. 
The file CompactSUV contains the data shown in Table 1.8 for 15 compact sports 
utility vehicles (SUVs) from the 2018 model line (Consumer Reports website): 


papp 


S DATA file Make—manufacturer 
CompactSUV Model—name of the model 

Overall Owner Overall Miles Acceleration 
Make Model Score Recommended Satisfaction per Gallon (0-60) Sec 
Subaru Forester 84 Yes + 26 8.7 
Honda CRV 83 Yes FE 27 8.6 
Toyota Rav4 81 Yes Hi 24 ors) 
Nissan Rogue 73 Yes T 24 ss) 
Mazda CX-5 Zil Yes aa 24 8.6 
Kia Sportage 71 Yes ar 23 9.6 
Ford Escape 69 Yes 0 23 10.1 
Volkswagen Tiguan Limited 67 No 0 21 8.5 
Volkswagen Tiguan 65 No t 25 10.3 
Mitsubishi Outlander 63 No 0 24 10.0 
Chevrolet Equinox 63 No 0 31 10.1 
Hyundai Tucson 57 No 0 26 8.4 
GMC Terrain 57 No (0) 22 T2 
Jeep Cherokee 55 No i 22 10.9 
Jeep Compass 50 No 0 24 9.8 
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Overall score—awarded based on a variety of measures, including those in this data set 
Recommended—Consumer Reports recommends the vehicle or not 
Owner satisfaction—satisfaction on a five-point scale based on the percentage of 
owners who would purchase the vehicle again (— —, —,0, +, ++). 
Overall miles per gallon—miles per gallon achieved in a 150-mile test trip 
Acceleration (0—60 sec)—time in seconds it takes vehicle to reach 60 miles per 
hour from a standstill with the engine idling 

. How many variables are in the data set? 

. Which of the variables are categorical, and which are quantitative? 

What percentage of these 15 vehicles are recommended? 

. What is the average of the overall miles per gallon across all 15 vehicles? 

. For owner satisfaction, construct a bar chart similar to Figure 1.4. 

. Show the frequency distribution for acceleration using the following intervals: 
7.0-7.9, 8.0-8.9, 9.0-9.9, and 10.0—10.9. Construct a histogram similar 
to Figure 1.5. 


moada Dp 
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APPENDIX 
APPENDIX 2.1: CREATING TABULAR AND GRAPHICAL PRESENTATIONS WITH R 
(MINDTAP READER) 


STATISTICS IN PRACTICE 


Colgate-Palmolive Company* 
NEW YORK, NEW YORK 


The Colgate-Palmolive Company started as a small 
soap and candle shop in New York City in 1806. Today, 
Colgate-Palmolive employs more than 34,000 people work- 
ing in more than 200 countries and territories around the COLGATE -PALMOLIVE 
world. Although best known for its brand names of Colgate, = 
Palmolive, and Fabuloso, the company also markets Irish — 
Spring, Hill's Science Diet, Softsoap, and Ajax products. 
The Colgate-Palmolive Company uses statistics in its 
quality assurance program for home laundry detergent 
products. One concern is customer satisfaction with the The Colgate-Palmolive Company uses statistical 
quantity of detergent in a carton. Every carton in each summaries to help maintain the quality of its products. 
size category is filled with the same amount of deter- Kurt Brady/Alamy Stock Fhoto 


gent by weight, but the volume of detergent is affected of these methods is to summarize data so that the data 


—— 


by the density of the detergent powder. For instance, can be easily understood and interpreted. 
if the powder density is on the heavy side, a smaller 
volume of detergent is needed to reach the carton’s Frequency Distribution of Density Data 
specified weight. As a result, the carton may appear to Density Frequency 
be underfilled when opened by the consumer. .29-.30 30 

To control the problem of heavy detergent powder, 31-.32 75 
limits are placed on the acceptable range of powder Boma 30] 
density. Statistical samples are taken periodically, and 25-6 9 
the density of each powder sample is measured. Data R= 3B 3 
summaries are then provided for operating personnel so 39.40 i 
that corrective action can be taken if necessary to keep Total 150 
the density within the desired quality specifications. 

A frequency distribution for the densities of 150 sam- Histogram of Density Data 
ples taken over a one-week period and a histogram are 
shown in the accompanying table and figure. Density 75 


levels above .40 are unacceptably high. The frequency 
distribution and histogram show that the operation is 
meeting its quality guidelines with all of the densities 
less than or equal to .40. Managers viewing these 
statistical summaries would be pleased with the quality 
of the detergent production process. 

In this chapter, you will learn about tabular and 
graphical methods of descriptive statistics such as 
frequency distributions, bar charts, histograms, stem- 0 
and-leaf displays, crosstabulations, and others. The goal 30 32 34 36 38 40 


Less than 1% 
of samples near the 
undesirable .40 level 


Frequency 


.29- .31- .33- .35- .37- .39 


*The authors are indebted to William R. Fowle, Manager of Quality Assur- Density 
ance, Colgate-Palmolive Company, for providing this Statistics in Practice. 
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As indicated in Chapter 1, data can be classified as either categorical or quantitat- 
ive. Categorical data use labels or names to identify categories of like items, and 
quantitative data are numerical values that indicate how much or how many. This 
chapter introduces the use of tabular and graphical displays for summarizing both 
categorical and quantitative data. Tabular and graphical displays can be found in annual 
reports, newspaper articles, and research studies. Everyone is exposed to these types of 
presentations. Hence, it is important to understand how they are constructed and how 
they should be interpreted. 

We begin with a discussion of the use of tabular and graphical displays to summar- 
ize the data for a single variable. This is followed by a discussion of the use of tabular 
and graphical displays to summarize the data for two variables in a way that reveals the 
relationship between the two variables. Data visualization is a term often used to describe 
the use of graphical displays to summarize and present information about a data set. The 
last section of this chapter provides an introduction to data visualization and provides 
guidelines for creating effective graphical displays. 


2.1 Summarizing Data for a Categorical Variable 
Frequency Distribution 


We begin the discussion of how tabular and graphical displays can be used to summarize 
categorical data with the definition of a frequency distribution. 


FREQUENCY DISTRIBUTION 


A frequency distribution is a tabular summary of data showing the number (frequency) of 
observations in each of several nonoverlapping categories or classes. 


Let us use the following example to demonstrate the construction and interpretation of 
a frequency distribution for categorical data. Coca-Cola, Diet Coke, Dr. Pepper, Pepsi, and 
Sprite are five popular soft drinks. Assume that the data in Table 2.1 show the soft drink 
selected in a sample of 50 soft drink purchases. 

To develop a frequency distribution for these data, we count the number of times each 
soft drink appears in Table 2.1. Coca-Cola appears 19 times, Diet Coke appears 8 times, 
Dr. Pepper appears 5 times, Pepsi appears 13 times, and Sprite appears 5 times. These 
counts are summarized in the frequency distribution in Table 2.2. 

This frequency distribution provides a summary of how the 50 soft drink purchases 
are distributed across the five soft drinks. This summary offers more insight than the 
original data shown in Table 2.1. Viewing the frequency distribution, we see that Coca- 
Cola is the leader, Pepsi is second, Diet Coke is third, and Sprite and Dr. Pepper are tied 


TABLE 2.1 Data from a Sample of 50 Soft Drink Purchases 


Coca-Cola Coca-Cola Coca-Cola Sprite Coca-Cola 

Diet Coke Dr. Pepper Diet Coke Dr. Pepper Diet Coke 
Pepsi Sprite Coca-Cola Pepsi Pepsi 
= DATA file Diet Coke Coca-Cola Sprite Diet Coke Pepsi 
SoftDrink Coca-Cola Diet Coke Pepsi Pepsi Pepsi 
Coca-Cola Coca-Cola Coca-Cola Coca-Cola Pepsi 

Dr. Pepper Coca-Cola Coca-Cola Coca-Cola Coca-Cola 

Diet Coke Sprite Coca-Cola Coca-Cola Dr. Pepper 
Pepsi Coca-Cola Pepsi Pepsi Pepsi 
Pepsi Diet Coke Coca-Cola Dr. Pepper Sprite 
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TABLE 2.2 Frequency Distribution of Soft Drink Purchases 


Soft Drink Frequency 

Coca-Cola 19 

Diet Coke 8 

Dr. Pepper 5 

Pepsi 13 

Sprite 5 
Total 50 


for fourth. The frequency distribution summarizes information about the popularity of 
the five soft drinks. 


Relative Frequency and Percent Frequency Distributions 


A frequency distribution shows the number (frequency) of observations in each of several 
nonoverlapping classes. However, we are often interested in the proportion, or percent- 
age, of observations in each class. The relative frequency of a class equals the fraction or 
proportion of observations belonging to a class. For a data set with n observations, the 
relative frequency of each class can be determined as follows: 


RELATIVE FREQUENCY 
Frequency of the class 
n (2.1) 


Relative frequency of a class = 


The percent frequency of a class is the relative frequency multiplied by 100. 

A relative frequency distribution gives a tabular summary of data showing the relative 
frequency for each class. A percent frequency distribution summarizes the percent frequency 
of the data for each class. Table 2.3 shows a relative frequency distribution and a percent 
frequency distribution for the soft drink data. In Table 2.3 we see that the relative frequency for 
Coca-Cola is 19/50 = .38, the relative frequency for Diet Coke is 8/50 = .16, and so on. From 
the percent frequency distribution, we see that 38% of the purchases were Coca-Cola, 16% of 
the purchases were Diet Coke, and so on. We can also note that 38% + 26% + 16% = 80% 
of the purchases were for the top three soft drinks. 


TABLE 2.3 Relative Frequency and Percent Frequency Distributions of Soft 


Drink Purchases 


Soft Drink Relative Frequency Percent Frequency 
Coca-Cola 38 38 
Diet Coke 16 16 
Dr. Pepper 10 10 
Pepsi 26 26 
Sprite 10 10 
Total 1.00 100 
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Using Excel to Construct a Frequency Distribution, a Relative 


Frequency Distribution, and a Percent Frequency Distribution 


We can use Excel’s Recommended PivotTables tool to construct a frequency distribution for the 
sample of 50 soft drink purchases. Two tasks are involved: Enter/Access Data and Apply Tools. 


= i 
= DATA file Enter/Access Data: Open the file SoftDrink. The data are in cells A2:A51 and a label is in cell A1. 


SoftDrink Apply Tools: The following steps describe how to use Excel’ s Recommended PivotTables 


tool to construct a frequency distribution for the sample of 50 soft drink purchases. 


Step 1. Select any cell in the data set (cells A1:A51) 

Step 2. Click the Insert tab on the Ribbon 

Step 3. In the Tables group click Recommended PivotTables; a preview showing the 
frequency distribution appears 

Step 4. Click OK; the frequency distribution will appear in a new worksheet 


The worksheet in Figure 2.1 shows the frequency distribution for the 50 soft drink purchases cre- 
ated using these steps. Also shown is the PivotTable Fields task pane, a key component of every 
PivotTable report. We will discuss the use of the PivotTable Fields task pane later in the chapter. 


Editing Options: You can easily change the column headings in the frequency distribu- 
tion output. For instance, to change the current heading in cell A3 (Row Labels) to “Soft 
Drink,” click cell A3 and type Soft Drink; to change the current heading in cell B3 (Count 
of Brand Purchased) to “Frequency,” click cell B3 and type Frequency; and to change the 
current heading in A9 (Grand Total) to “Total,” click cell A9 and type Total. Figure 2.2 
contains the revised headings; in addition, the headings “Relative Frequency” and “Percent 
Frequency” were entered into cells C3 and D3. We will now show how to construct the 
relative frequency and percent frequency distributions. 


Enter Functions and Formulas: Refer to Figure 2.2 as we describe how to create the relat- 
ive and percent frequency distributions for the soft drink purchases. The formula worksheet 
is in the background and the value worksheet in the foreground. To compute the relative 
frequency for Coca-Cola using equation (2.1), we entered the formula =B4/$B$9 in cell 
C4; the result, 0.38, is the relative frequency for Coca-Cola. Copying cell C4 to cells C5:C8 
computes the relative frequencies for each of the other soft drinks. To compute the percent 
frequency for Coca-Cola, we entered the formula =C4*/00 in cell D4. The result, 38, indic- 
ates that 38% of the soft drink purchases were Coca-Cola. Copying cell D4 to cells D5:D8 
computes the percent frequencies for each of the other soft drinks. To compute the total of 
the relative and percent frequencies we used Excel’s SUM function in cells C9 and D9. 


FIGURE 2.1 | Frequency Distribution of Soft Drink Purchases Constructed 
Using Excel's Recommended PivotTables Tool 


os n € D = PivotTable Fields š: 
2 Choose fields to add to report: ar 
3 Row Labels ~ Count of Brand Purchased Searct p 
4 Coca-Cola 19 

5 Diet Coke 8 ¥ Brand Purchased e 
6 Dr. Pepper 5 MORE TABLES. bs 
7 Pepsi 13 Drag fields between areas below. 

8 Sprite 5 Y i 

9 Grand Total 50 FILTERS COLUMNS 

10 

11 

12 = ROWS E VALUES 

i Brand Purchased + Count of Brand Purchased Y 
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Relative Frequency and Percent Frequency Distributions of 


Soft Drink Purchases Constructed Using Excel's Recommended 
PivotTables Tool 


A | B | € D E | 

1 | 

2 | 

Al [Soft Drink ~ [Frequency Relative Frequency Percent Frequency 

4 Coca-Cola 19 =B4/$B$9 =C4*100 

5 |Diet Coke 8 =B5/$B$9 =C5*100 

6 Dr. Pepper 5 =B6/$B$9 =C6*100 

7 |Pepsi 13 =B7/SB$9 =C7*100 

8 |Sprite 5 =B8/$B$9 =C8*100 

9 Total 50 =SUM(C4:C8) =SUM(D4:D8) 

10° A B e D E 

11 | i | 
2 
3 [Soft Drink ~ Frequency Relative Frequency Percent Frequency 
4 Coca-Cola 19 0.38 38 
5 |Diet Coke 8 0.16 16 
6 |Dr. Pepper 5 0.1 10 
7 (Pepsi 13 0.26 26 
8 | Sprite 5 0.1 10 
9 |Total 50 1 100 
10. 


Bar Charts and Pie Charts 


A bar chart is a graphical display for depicting categorical data summarized in a 
frequency, relative frequency, or percent frequency distribution. On one axis of the graph 
we specify the labels that are used for the classes (categories). A frequency, relative 
frequency, or percent frequency scale can be used for the other axis of the chart. Then, 
using a bar of fixed width drawn above or next to each class label, we extend the length 
of the bar until we reach the frequency, relative frequency, or percent frequency of the 
class. For categorical data, the bars should be separated to emphasize the fact that each 
class is separate. Figure 2.3 shows a bar chart of the frequency distribution for the 50 soft 


FIGURE 2.3 Bar Chart of Soft Drink Purchases 
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FIGURE 2.4 Sorted Bar Chart of Soft Drink Purchases 
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drink purchases. Note how the graphical presentation shows Coca-Cola, Pepsi, and Diet 
Coke to be the most preferred brands. We can make the brand preferences even more ob- 
vious by creating a sorted bar chart as in Figure 2.4. Here, we sort the drink categories: 
highest frequency on the left and lowest frequency on the right. 

In Figures 2.3 and 2.4 the horizontal axis was used to specify the labels for the 
categories; thus, the bars of the chart appear vertically in the display. In Excel, this type 
of display is referred to as a column chart. We could also display the bars for the chart 
horizontally by using the vertical axis to display the labels; Excel refers to this type of 
display as a bar chart. The choice of whether to display the bars vertically or horizont- 
ally depends upon what you want the final chart to look like. Throughout the text we will 
refer to either type of display as a bar chart. 

The pie chart provides another graphical display for presenting relative frequency 
and percent frequency distributions for categorical data. To construct a pie chart, we 
first draw a circle to represent all the data. Then we use the relative frequencies to 
subdivide the circle into sectors, or parts, that correspond to the relative frequency for 
each class. For example, because a circle contains 360 degrees and Coca-Cola shows 
a relative frequency of .38, the sector of the pie chart labeled Coca-Cola consists of 
.38(360) = 136.8 degrees. The sector of the pie chart labeled Diet Coke consists of 
.16(360) = 57.6 degrees. Similar calculations for the other classes yield the pie chart 
in Figure 2.5. The numerical values shown for each sector can be frequencies, relative 
frequencies, or percent frequencies. Although pie charts are common ways of visualiz- 
ing data, many data visualization experts do not recommend their use because people 
have difficulty perceiving differences in area. In most cases, a bar chart is superior to a 
pie chart for displaying categorical data. 

Numerous options involving the use of colors, shading, legends, text font, and three- 
dimensional perspectives are available to enhance the visual appearance of bar and pie 
charts. However, one must be careful not to overuse these options because they may not 
enhance the usefulness of the chart. For instance, consider the three-dimensional pie 
chart for the soft drink data shown in Figure 2.6. Compare it to the simpler presentation 
shown in Figures 2.3-2.5. The three-dimensional perspective adds no new under- 
standing. The use of a legend in Figure 2.6 also forces your eyes to shift back and forth 
between the key and the chart. Most readers find the sorted bar chart in Figure 2.4 much 
easier to interpret because it is obvious which soft drinks have the highest frequencies. 

In general, pie charts are not the best way to present percentages for comparison. In 
Section 2.5 we provide additional guidelines for creating effective visual displays. 
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FIGURE 2.5 Pie Chart of Soft Drink Purchases 


FIGURE 2.6 Three-Dimensional Pie Chart of Soft Drink Purchases 


E Coca-Cola 
E Pepsi 
E Diet Coke 
= Dr. Pepper 
E Sprite 


Using Excel to Construct a Bar Chart 


We can use Excel’s Recommended Charts tool to construct a bar chart for the sample 
of 50 soft drink purchases. Two tasks are involved: Enter/Access Data and Apply 
Tools. 


Enter/Access Data: Open the file SoftDrink. The data are in cells A2:A51 and a label is in 
cell Al. 


Apply Tools: The following steps describe how to use Excel’s Recommended Charts tool to 
construct a bar chart for the sample of 50 soft drink purchases. 


Step 1. Select any cell in the data set (cells A1:A51) 

Step 2. Click the Insert tab on the Ribbon 

Step 3. In the Charts group click Recommended Charts; a preview showing the bar 
chart appears 

Step 4. Click OK; the bar chart will appear in a new worksheet 


The worksheet in Figure 2.7 shows the bar chart for the 50 soft drink purchases created 
using these steps. Also shown are the frequency distribution and PivotTable Fields task 


Excel refers to the bar 
chart in Figure 2.7 as a 
Clustered Column chart. 
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FIGURE 2.7 | Bar Chart of Soft Drink Purchases Constructed Using Excel’s 
Recommended Charts Tool 
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15 Brend furchasad = 
16 
17 
18 = ROWS £ vaes 
k Biang Purchased bd Count of Brand Purchased > 
21 
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pane that were created by Excel in order to construct the bar chart. Thus, using Excel’s 
Recommended Charts tool you can construct a bar chart and a frequency distribution at 
the same time. 


Editing Options: You can easily edit the bar chart to display a different chart title and add 
axis titles. For instance, suppose you would like to create a sorted bar chart to use “Bar 
Chart of Soft Drink Purchases” as the chart title, to use “Soft Drink” for the horizontal axis 
title and to use “Frequency” for the vertical axis title. 


Step 1. Click the Chart Title and replace it with Bar Chart of Soft Drink Purchases 

Step 2. Click the Chart Elements button * (located next to the top right corner of the 
chart) 

Step 3. When the list of chart elements appears: 

Click Axis Titles (creates placeholders for the axis titles) 

Step 4. Click the horizontal Axis Title placeholder and replace it with Soft Drink 

Step 5. Click the vertical Axis Title placeholder and replace it with Frequency 

Step 6. Right-click on any of the bars in the chart, select Sort and click Sort Largest 
to Smallest 


The edited bar chart is shown in Figure 2.8. 


FIGURE 2.8 | Edited Sorted Bar Chart of Soft Drink Purchases Constructed 


Using Excel’s Recommended Charts Tool 
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Methods 
1. The response to a question has three alternatives: A, B, and C. A sample of 120 
responses provides 60 A, 24 B, and 36 C. Show the frequency and relative frequency 
distributions. 
2. A partial relative frequency distribution is given. 


Class Relative Frequency 
A 22 
B 18 
C .40 
D 


a. What is the relative frequency of class D? 
b. The total sample size is 200. What is the frequency of class D? 
c. Show the frequency distribution. 
d. Show the percent frequency distribution. 
3. A questionnaire provides 58 Yes, 42 No, and 20 No-Opinion answers. 
a. In the construction of a pie chart, how many degrees would be in the section of the 
pie showing the Yes answers? 
b. How many degrees would be in the section of the pie showing the No answers? 
c. Construct a pie chart. 
d. Construct a bar chart. 


Applications 
4. Most Visited Websites. In a recent report, the top five most-visited English-language 
websites were google.com (GOOG), facebook.com (FB), youtube.com (YT), yahoo. 
com (YAH), and wikipedia.com (WIKI). The most-visited websites for a sample of 
50 Internet users are shown in the following table: 


(@ 


DATA fil e YAH WIKI YT WIKI GOOG 
Wenaies YT YAH GOOG GOOG GOOG 

WIKI GOOG YAH YAH YAH 
YAH YT GOOG YT YAH 
GOOG FB FB WIKI GOOG 
GOOG GOOG FB FB WIKI 
FB YAH YT YAH YAH 
YT GOOG YAH FB FB 
WIKI GOOG YAH WIKI WIKI 
YAH YT GOOG GOOG WIKI 


a. Are these data categorical or quantitative? 
b. Provide frequency and percent frequency distributions. 
c. On the basis of the sample, which website is most frequently visited website for 
Internet users? Which is second? 
5. Most Popular Last Names. In alphabetical order, the six most common last names in 
the United States in 2018 were Brown, Garcia, Johnson, Jones, Smith, and Williams 
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(United States Census Bureau website). Assume that a sample of 50 individuals with 
one of these last names provided the following data: 


Brown Williams Williams Williams Brown 
Smith Jones Smith Johnson Smith 
Garcia Smith Brown Williams Johnson 
SS : Johnson Smith Smith Johnson Brown 
= DATA f ile Williams Garcia Johnson Williams Johnson 
Namesa018 Williams Johnson Jones Smith Brown 
Johnson Smith Smith Brown Jones 
Jones Jones Smith Smith Garcia 
Garcia Jones Williams Garcia Smith 
Jones Johnson Brown Johnson Garcia 


Summarize the data by constructing the following: 

a. Relative and percent frequency distributions 

. A bar chart 

. A sorted bar chart 

. A pie chart 

. Based on these data, what are the three most common last names? Which type of 

chart makes this most apparent? 

6. Top-Rated Television Show Networks. Nielsen Media Research provided the list of 
the 25 top-rated single shows in television history. The following data show the televi- 
sion network that produced each of these 25 top-rated shows. 


CBS CBS NBC FOX CBS 
Ss . CBS NBC NBC NBC ABC 
== DATA file 
= Í ABC NBC ABC ABC NBC 
CBS NBC CBS ABC NBC 
NBC CBS CBS ABC CBS 


ona gT 


NetworkShows 


a. Construct a frequency distribution, percent frequency distribution, and bar chart for 
the data. 

b. Which networks have done the best in terms of presenting top-rated television 
shows? Compare the performance of ABC, CBS, and NBC. 

7. Airline Customer Satisfaction Survey. Many airlines use surveys to collect data on 
customer satisfaction related to flight experiences. After completing a flight, customers 
receive an e-mail asking them to go to the website and rate a variety of factors, includ- 
ing the reservation process, the check-in process, luggage policy, cleanliness of gate 
area, service by flight attendants, food/beverage selection, on-time arrival, and so on. 
A five-point scale, with Excellent (E), Very Good (V), Good (G), Fair (F), and Poor 
(P), is used to record customer ratings. Assume that passengers on a Delta Airlines 
flight from Myrtle Beach, South Carolina to Atlanta, Georgia provided the following 
ratings for the question “Please rate the airline based on your overall experience with 
this flight.” The sample ratings are shown below. 


E E G V V 
E G V E E 
V V V F V 
G E V E V 
E E V V E P E V P 
a. Use a percent frequency distribution and a bar chart to summarize these data. What do 

these summaries indicate about the overall customer satisfaction with the Delta flight? 
b. The online survey questionnaire enabled respondents to explain any aspect of the flight 


that failed to meet expectations. Would this be helpful information to a manager look- 
ing for ways to improve the overall customer satisfaction on Delta flights? Explain. 


SDATAfile 


AirSurvey 


mim < oo 
gdm < 
<U Mm < 
<a m< 
<m < 


E$ 
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8. Baseball Hall of Fame Positions. Data for a sample of 55 members of the Baseball 
Hall of Fame in Cooperstown, New York are shown here. Each observation indicates 
the primary position played by the Hall of Famers: pitcher (P), catcher (H), first base 
(1), second base (2), third base (3), shortstop (S), left field (L), center field (C), and 


right field (R). 
= . L P C H 2 P R 1 S S 1 L P R P 
& DATA file P P P RC S L RP C C PPRP 
Bozepalmal 2 3 P H L P 1 C P P P S 1 L R 
R 1 2 H S 3 H 2 L P 
a. Construct frequency and relative frequency distributions to summarize the data. 
b. What position provides the most Hall of Famers? 
c. What position provides the fewest Hall of Famers? 
d. What outfield position (L, C, or R) provides the most Hall of Famers? 
e. Compare infielders (1, 2, 3, and S) to outfielders (L, C, and R). 
9. Degrees Awarded Annually. Nearly 1.9 million bachelor’s degrees and over 
758,000 master’s degrees are awarded annually by U.S. postsecondary institutions 
as of 2018 (National Center for Education Statistics website). The Department of 
Education tracks the field of study for these graduates in the following categories: 
Business (B), Computer Sciences and Engineering (CSE), Education (E), Humanit- 
ies (H), Natural Sciences and Mathematics (NSM), Social and Behavioral Sciences 
(SBS), and Other (O). Consider the following samples of 100 graduates: 
EA F Bachelor’s Degree Field of Study 
= sau A ie SBS H â H H E B O SBS NSM CSE 
O B B O O H B O SBS O 
H CSE CSE O CSE B H (0) O SBS 
SBS SBS B H NSM B B O SBS SBS 
B H SBS O B B O O B O 
O H SBS H CSE CSE B E CSE SBS 
SBS NSM NSM CSE H H E E SBS CSE 
NSM NSM SBS O H H B SBS SBS NSM 
H B B O O O NSM H E B 
E B (0) B B B O O (0) O 
Master’s Degree Field of Study 
O O B O B E B H E B 
O E SBS B CSE H B E E O 
O B B O E CSE NSM O B E 
H H B E SBS E E B O E 
SBS B B CSE H B B CSE SBS B 
CSE B E CSE B E CSE O E O 
B O E O B NSM H E B E 
B E B O E E H O O O 
CSE O (0) H B O B E CSE O 
E O SBS E E O SBS B B O 


a. Provide a percent frequency distribution of field of study for each degree. 

b. Construct a bar chart for field of study for each degree. 

c. What is the lowest percentage field of study for each degree? 

d. What is the highest percentage field of study for each degree? 

e. Which field of study has the largest increase in percentage from bachelor’s 
to master’s? 
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: 10. Online Hotel Ratings. TripAdvisor is one of many online websites that provides 

DATA f ile ratings for hotels throughout the world. Ratings provided by 649 guests at the 

HotelRatings Lakeview Hotel can be found in the file HotelRatings. Possible responses were 
Excellent, Very Good, Average, Poor, and Terrible. 
a. Construct a frequency distribution. 

. Construct a percent frequency distribution. 

. Construct a bar chart for the percent frequency distribution. 

. Comment on how guests rate their stay at the Sheraton Anaheim Hotel. 

. Suppose that results for 1679 guests who stayed at the Timber Hotel provided the 

following frequency distribution. 


(@ 


ono oe 


Rating Frequency 
Excellent 807 
Very Good 521 
Average 200 
Poor 107 
Terrible 44 


Compare the ratings for the Timber Hotel with the results obtained for the 
Lakeview Lodge. 


2.2 Summarizing Data for a Quantitative Variable 


Frequency Distribution 


As defined in Section 2.1, a frequency distribution is a tabular summary of data showing 
the number (frequency) of observations in each of several nonoverlapping categories or 
classes. This definition holds for quantitative as well as categorical data. However, with 
quantitative data we must be more careful in defining the nonoverlapping classes to be 
used in the frequency distribution. 

For example, consider the quantitative data in Table 2.4. These data show the time 
in days required to complete year-end audits for a sample of 20 clients of Sanderson and 
Clifford, a small public accounting firm. The three steps necessary to define the classes for 
a frequency distribution with quantitative data are as follows: 


1. Determine the number of nonoverlapping classes. 
2. Determine the width of each class. 
3. Determine the class limits. 


Let us demonstrate these steps by developing a frequency distribution for the audit time 
data in Table 2.4. 


Number of classes Classes are formed by specifying ranges that will be used to group 
the data. As a general guideline, we recommend using between 5 and 20 classes. For a 
small number of data items, as few as 5 or 6 classes may be used to summarize the data. 


TABLE 2.4 Year-End Audit Times (In Days) 


DATA file 12 14 19 18 


15 
Audit 15 18 17 


(Q 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


48 Chapter 2 Descriptive Statistics: Tabular and Graphical Displays 


For a larger number of data items, a larger number of classes is usually required. The 
goal is to use enough classes to show the variation in the data, but not so many classes 
that some contain only a few data items. Because the number of data items in Table 2.4 is 
relatively small (n = 20), we chose to develop a frequency distribution with five classes. 


Making the classes the Width of the classes The second step in constructing a frequency distribution for quant- 
same width reduces the itative data is to choose a width for the classes. As a general guideline, we recommend 
chance of inappropriate that the width be the same for each class. Thus the choices of the number of classes and 
interpretations by the user. the width of classes are not independent decisions. A larger number of classes means a 
smaller class width, and vice versa. To determine an approximate class width, we begin by 
identifying the largest and smallest data values. Then, with the desired number of classes 
specified, we can use the following expression to determine the approximate class width. 


A T Largest data value — Smallest data value 
Approximate class width = (2.2) 
Number of classes 


The approximate class width given by equation (2.2) can be rounded to a more convenient 
value based on the preference of the person developing the frequency distribution. For 
example, an approximate class width of 9.28 might be rounded to 10 simply because 10 is 
a more convenient class width to use in presenting a frequency distribution. 
For the data involving the year-end audit times, the largest data value is 33 and the 
No single frequency smallest data value is 12. Because we decided to summarize the data with five classes, 
distribution is best for a using equation (2.2) provides an approximate class width of (33 — 12)/5 = 4.2. We 
data set. Different people therefore decided to round up and use a class width of five days in the frequency 
may construct different, distribution. 


but equally acceptable, In practice, the number of classes and the appropriate class width are determined by 
frequency distributions. trial and error. Once a possible number of classes is chosen, equation (2.2) is used to find 
The goal is to reveal the the approximate class width. The process can be repeated for a different number of classes. 
natural grouping and Ultimately, the analyst uses judgment to determine the combination of the number of classes 
variation in the data. and class width that provides the best frequency distribution for summarizing the data. 


For the audit time data in Table 2.4, after deciding to use five classes, each with a width 
of five days, the next task is to specify the class limits for each of the classes. 


Class limits Class limits must be chosen so that each data item belongs to one and only 
one class. The lower class limit identifies the smallest possible data value assigned to the 
class. The upper class limit identifies the largest possible data value assigned to the class. 
In developing frequency distributions for categorical data, we did not need to specify class 
limits because each data item naturally fell into a separate class. But with quantitative data, 
such as the audit times in Table 2.4, class limits are necessary to determine where each 
data value belongs. 

Using the audit time data in Table 2.4, we selected 10 days as the lower class limit and 
14 days as the upper class limit for the first class. This class is denoted 10-14 in Table 2.5. 
The smallest data value, 12, is included in the 10—14 class. We then selected 15 days as the 
lower class limit and 19 days as the upper class limit of the next class. We continued defin- 
ing the lower and upper class limits to obtain a total of five classes: 10-14, 15-19, 20-24, 
25-29, and 30-34. The largest data value, 33, is included in the 30-34 class. The difference 
between the lower class limits of adjacent classes is the class width. Using the first two 
lower class limits of 10 and 15, we see that the class width is 15 — 10 = 5. 

With the number of classes, class width, and class limits determined, a frequency 
distribution can be obtained by counting the number of data values belonging to each 
class. For example, the data in Table 2.4 show that four values—12, 14, 14, and 13— 
belong to the 10-14 class. Thus, the frequency for the 10—14 class is 4. Continuing 
this counting process for the 15-19, 20-24, 25-29, and 30-34 classes provides the 
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TABLE 2.5 Frequency Distribution for the Audit Time Data 


Audit Time (days) Frequency 
10-14 
15-19 
20-24 
25-29 
30-34 
Total 


frequency distribution in Table 2.5. Using this frequency distribution, we can observe 
the following: 


MS i Gl oo e 


1. The most frequently occurring audit times are in the class of 15—19 days. Eight of 
the 20 audit times belong to this class. 
2. Only one audit required 30 or more days. 


Other conclusions are possible, depending on the interests of the person viewing the fre- 
quency distribution. The value of a frequency distribution is that it provides insights about 
the data that are not easily obtained by viewing the data in their original unorganized form. 


Class midpoint In some applications, we want to know the midpoints of the classes in 

a frequency distribution for quantitative data. The class midpoint is the value halfway 
between the lower and upper class limits. For the audit time data, the five class midpoints 
are 12, 17, 22, 27, and 32. 


Relative Frequency and Percent Frequency Distributions 


We define the relative frequency and percent frequency distributions for quantitative data 
in the same manner as for categorical data. First, recall that the relative frequency is the 
proportion of the observations belonging to a class. With n observations, 


Frequency of the class 


Relative frequency of class = z 


The percent frequency of a class is the relative frequency multiplied by 100. 

Based on the class frequencies in Table 2.5 and with n = 20, Table 2.6 shows the relative 
frequency distribution and percent frequency distribution for the audit time data. Note that .40 
of the audits, or 40%, required from 15 to 19 days. Only .05 of the audits, or 5%, required 30 
or more days. Again, additional interpretations and insights can be obtained by using Table 2.6. 


TABLE 2.6 Relative Frequency and Percent Frequency Distributions 


for the Audit Time Data 


Audit Time (days) Relative Frequency Percent Frequency 
10-14 20 20 
15-19 .40 40 
20-24 25 25 
25-29 .10 10 
30-34 205 5 
Total 1.00 100 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


50 


S DATAsile 
Audit 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


Chapter 2 Descriptive Statistics: Tabular and Graphical Displays 


Using Excel to Construct a Frequency Distribution 


We can use Excel’s PivotTable tool to construct a frequency distribution for the audit time 
data. Two tasks are involved: Enter/Access Data and Apply Tools. 


Enter/Access Data: Open the file Audit. The data are in cells A2:A21 and a label is in cell Al. 


Apply Tools: The following steps describe how to use Excel’s PivotTable tool to construct 
a frequency distribution for the audit time data. When using Excel’s PivotTable tool, each 
column of data is referred to as a field. Thus, for the audit time example, the data appearing 
in cells A2:A21 and the label in cell A1 are referred to as the Audit Time field. 


Step 1. Select any cell in the data set (cells A1:A21) 
Step 2. Click the Insert tab on the Ribbon 
Step 3. In the Tables group click PivotTable 
Step 4. When the Create PivotTable dialog box appears: 
Click OK; a PivotTable and PivotTable Fields task pane will appear in a 
new worksheet 
Step 5. In the PivotTable Fields task pane: 
Drag Audit Time to the ROWS area 
Drag Audit Time to the VALUES area 
Step 6. Click on Sum of Audit Time in the VALUES area 
Step 7. Select Value Field Settings... from the list of options that appears 
Step 8. When the Value Field Settings dialog box appears: 
Under Summarize value field by, choose Count 
Click OK 


Figure 2.9 shows the resulting PivotTable Fields task pane and the corresponding Pivot- 
Table. To construct the frequency distribution shown in Table 2.5, we must group the rows 
containing the audit times. The following steps accomplish this. 


Step 1. Right-click cell A4 in the PivotTable or any other cell containing an audit time 
Step 2. Choose Group from the list of options that appears 


PivotTable Fields Task Pane and Initial PivotTable Used 
to Construct a Frequency Distribution for the Audit Time Data 


PivotTable Fields 


Choose fields to add to report: O~ 


[Row Labels ~ |Count of Audit Time Search p 
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The same Excel procedures 
we followed in the previous 
section can now be used 
to develop relative and 
percent frequency distribu- 
tions if desired. 
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FIGURE 2.10 | Frequency Distribution for the Audit Time Data Constructed 
Using Excel's PivotTable Tool 
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Step 3. When the Grouping dialog box appears: 
Enter /0 in the Starting at: box 
Enter 34 in the Ending at: box 
Enter 5 in the By: box 
Click OK 


Figure 2.10 shows the completed PivotTable Fields task pane and the corresponding Pivot- 
Table. We see that with the exception of the column headings, the PivotTable provides the 
same information as the frequency distribution shown in Table 2.5. 


Editing Options: You can easily change the labels in the PivotTable to match the labels 
in Table 2.5. For instance, to change the current heading in cell A3 (Row Labels) to 
“Audit Time (days),” click in cell A3 and type Audit Time (days); to change the current 
heading in cell B3 (Count of Audit Time) to “Frequency,” click in cell B3 and type 
Frequency; and to change the current heading in A9 (Grand Total) to Total, click in cell 
A9 and type Total. 


In smaller data sets or when there are a large number of classes, some classes may have 
no data values. In such cases, Excel’s PivotTable tool will remove these classes when 
constructing the frequency distribution. For presentation, however, we recommend editing 
the results to show all the classes, including those with no data values. The following steps 
show how this can be done. 


Step 1. Right-click on any cell in the Row Labels column of the PivotTable 
Step 2. Click Field Settings... 
Step 3. When the Field Settings dialog box appears: 

Click the Layout and Print tab 

Choose Show items with no data 

Click OK 


Dot Plot 


One of the simplest graphical summaries of data is a dot plot. A horizontal axis shows 
the range for the data. Each data value is represented by a dot placed above the axis. 
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FIGURE 2.11 Dot Plot for the Audit Time Data 
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Figure 2.11 is the dot plot for the audit time data in Table 2.4. The three dots located 
above 18 on the horizontal axis indicate that an audit time of 18 days occurred three 
times. Dot plots show the details of the data and are useful for comparing the distribu- 
tion of the data for two or more variables. 


Histogram 


A common graphical display of quantitative data is a histogram. This graphical display 
can be prepared for data previously summarized in either a frequency, relative frequency, or 
percent frequency distribution. A histogram is constructed by placing the variable of interest 
on the horizontal axis and the frequency, relative frequency, or percent frequency on the 
vertical axis. The frequency, relative frequency, or percent frequency of each class is shown 
by drawing a rectangle whose base is determined by the class limits on the horizontal axis 
and whose height is the corresponding frequency, relative frequency, or percent frequency. 
Figure 2.12 is a histogram for the audit time data. Note that the class with the greatest 
frequency is shown by the rectangle appearing above the class of 15—19 days. The height of 
the rectangle shows that the frequency of this class is 8. A histogram for the relative or percent 
frequency distribution of these data would look the same as the histogram in Figure 2.12 with 
the exception that the vertical axis would be labeled with relative or percent frequency values. 
As Figure 2.12 shows, the adjacent rectangles of a histogram touch one another. 
Unlike a bar chart, a histogram contains no natural separation between the rectangles 
of adjacent classes. This format is the usual convention for histograms. Because the 
classes for the audit time data are stated as 10-14, 15-19, 20-24, 25-29, and 30-34, 
one-unit spaces of 14 to 15, 19 to 20, 24 to 25, and 29 to 30 would seem to be needed 
between the classes. These spaces are eliminated when constructing a histogram. 


FIGURE 2.12 Histogram for the Audit Time Data 


Frequency 
Le N U A A A N © 


10-14 15-19 20-24 25-29 30-34 
Audit Time (days) 
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FIGURE 2.13 Histograms Showing Differing Levels of Skewness 


Panel A: Moderately Skewed Left Panel B: Moderately Skewed Right 

0.35 0.35 
0.3 0.3 
0.25 0.25 
0.2 0.2 
0.15 0.15 
0.1 0.1 
0.05 0.05 
0 0 

Panel C: Symmetric Panel D: Highly Skewed Right 
0.3 0.4 
0.25 0.35 
0.3 
02 0.25 
0.15 0.2 
01 0.15 
0.1 
eee 0.05 
0 0 


Eliminating the spaces between classes in a histogram for the audit time data helps 
show that all values between the lower limit of the first class and the upper limit of the 
last class are possible. 

One of the most important uses of a histogram is to provide information about the 
shape, or form, of a distribution. Figure 2.13 contains four histograms constructed from 
relative frequency distributions. Panel A shows the histogram for a set of data moder- 
ately skewed to the left. A histogram is said to be skewed to the left if its tail extends 
farther to the left. This histogram is typical for exam scores, with no scores above 
100%, most of the scores above 70%, and only a few really low scores. Panel B shows 
the histogram for a set of data moderately skewed to the right. A histogram is said to 
be skewed to the right if its tail extends farther to the right. An example of this type of 
histogram would be for data such as housing prices; a few expensive houses create the 
skewness in the right tail. 

Panel C shows a symmetric histogram. In a symmetric histogram, the left tail mirrors 
the shape of the right tail. Histograms for data found in applications are never perfectly 
symmetric, but the histogram for many applications may be roughly symmetric. Data for 
SAT scores, heights and weights of people, and so on lead to histograms that are roughly 
symmetric. Panel D shows a histogram highly skewed to the right. This histogram was con- 
structed from data on the dollar amount of customer purchases over one day at a women’s 
apparel store. Data from applications in business and economics often lead to histograms 
that are skewed to the right. For instance, data on housing prices, salaries, purchase amounts, 
and so on often result in histograms skewed to the right. 
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Using Excel's Recommended Charts Tool to Construct 


a Histogram 


In Figure 2.10 we showed the results of using Excel’s PivotTable tool to construct a frequency 
distribution for the audit time data. We will use these results to illustrate how Excel’s Recom- 
mended Charts tool can be used to construct a histogram for depicting quantitative data sum- 
marized in a frequency distribution. Refer to Figure 2.14 as we describe the steps involved. 


Apply Tools: The following steps describe how to use Excel’s Recommended Charts tool to 
construct a histogram for the audit time data. 


Step 1. Select any cell in the PivotTable report (cells A3:B9) 

Step 2. Click the Insert tab on the Ribbon 

Step 3. In the Charts group click Recommended Charts; a preview showing the rec- 
ommended chart appears 

Step 4. Click OK 


The worksheet in Figure 2.14 shows the chart for the audit time data created using these 
steps. With the exception of the gaps separating the bars, this resembles the histogram for 
the audit time data shown in Figure 2.12. We can easily edit this chart to remove the gaps 
between the bars and enter more descriptive axis labels and a chart heading. 


Editing Options: In addition to removing the gaps between the bars, suppose you would 
like to use “Histogram for Audit Time Data” as the chart title and insert “Audit Time (days)”’ 
for the horizontal axis title and “Frequency” for the vertical axis title. 


Step 1. Right-click any bar in the chart and choose Format Data Series... from the list 
of options that appears 
Step 2. When the Format Data Series task pane appears: 
Click Series Options to display the options 
Set the Gap Width to 0 
Click the Close button X at the top right of the task pane 
Step 3. Click the Chart Title and replace it with Histogram for Audit Time Data 
Step 4. Click the Chart Elements button * (located next to the top right corner of the 
chart) 
Step 5. When the list of chart elements appears: 
Select the check box for Axis Titles (creates placeholders for the axis titles) 
Deselect the check box for Legend 


Initial Chart Used to Construct a Histogram for the Audit 


Time Data 
AEN B c D E F a : 1 PivotTable Fields or 
5 en Choose fields to add to report o- 
3 [Row Labels - |Count of Audit Time ry Fs 
4 10-14 4 Total 
3 15-19 8 ersten 
6 20-24 3 MORE TABLES_ 
7 25-29 2 
8 30-34 1 
9 Grand Total 20 


13 Drag fields between areas below 
i4 10-14 15-19 20-28 25.29 wu a 
FILTERS E COWMNS 
15 Aude Time 
16 
17 
18 = < 
19 ROWS E VALUES 
20 Audit Time © Count of Audit Time . 
21 
2 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


2.2 Summarizing Data for a Quantitative Variable 55 


FIGURE 2.15 | Histogram for the Audit Time Data Created Using Excel's 
Recommended Charts Tool 


' A c = ee 
A B p E £ G H z Format Data Series ~ * 
2 Oo q Series Options ¥ 

Count of Audit Time 

3 Row Labels - Count of Audit Time > O 
4 10-14 4 Histogram for Audit Time Data 
5 15-19 8 is 4 Series Options 
6 20-24 5 ‘ Plot Series On 
7 25-29 ao | zx 
8 30-34 1 s ° 

O #8, q 
9 Grand Total 20 £ Total Series Overlap -B— 27% 
19 ž Gap Width E— 0% 
1 = 
12 10-14 1519 20-24 25-29 30-34 
13 Audit Time (days) 
4 
15 Auda Time > 

(oz oO a 


Step 6. Click the horizontal Axis Title placeholder and replace it with Audit Time 


(days) 
Step 7. Click the vertical Axis Title placeholder and replace it with Frequency 


The edited histogram for the audit time data is shown in Figure 2.15. 


In smaller data sets or when there are a large number of classes, some classes may have no 
data values. In such cases, Excel’s PivotTable tool will automatically remove these classes 
when constructing the frequency distribution, and the histogram created using Excel’s Re- 
commended Charts tool will not show the classes with no data values. For presentation, 
however, we recommend editing the results to show all the classes, including those with no 
data values. The following steps show how this can be done by editing the PivotTable tool 
output before constructing the histogram. 


Step 1. Right-click on any cell in the Row Labels column of the PivotTable 
Step 2. Click Field Settings... 
Step 3. When the Field Settings dialog box appears: 

Click the Layout & Print tab 

Choose Show items with no data 

Click OK 


After performing steps 1-3, the PivotTable will display all classes, including those with no 
data values. Using the Recommended Charts tool will then create a histogram that displays 
all classes. 


Cumulative Distributions 


A variation of the frequency distribution that provides another tabular summary of quant- 
itative data is the cumulative frequency distribution. The cumulative frequency distribu- 
tion uses the number of classes, class widths, and class limits developed for the frequency 
distribution. However, rather than showing the frequency of each class, the cumulative 
frequency distribution shows the number of data items with values less than or equal to the 
upper class limit of each class. The first two columns of Table 2.7 provide the cumulative 
frequency distribution for the audit time data. 

To understand how the cumulative frequencies are determined, consider the class 
with the description “less than or equal to 24.” The cumulative frequency for this class is 
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TABLE 2.7 Cumulative Frequency, Cumulative Relative Frequency, and 


Cumulative Percent Frequency Distributions for the Audit Time Data 


Cumulative Cumulative Cumulative 
Audit Time (days) Frequency Relative Frequency Percent Frequency 
Less than or equal to 14 4 20 20 
Less than or equal to 19 12 .60 60 
Less than or equal to 24 i7 85 85 
Less than or equal to 29 19 5 95 
Less than or equal to 34 20 1.00 100 


simply the sum of the frequencies for all classes with data values less than or equal to 24. 
For the frequency distribution in Table 2.5, the sum of the frequencies for classes 10-14, 
15-19, and 20-24 indicates that 4 + 8 + 5 = 17 data values are less than or equal to 24. 
Hence, the cumulative frequency for this class is 17. In addition, the cumulative frequency 
distribution in Table 2.7 shows that four audits were completed in 14 days or less and 19 
audits were completed in 29 days or less. 

As a final point, we note that a cumulative relative frequency distribution shows 
the proportion of data items, and a cumulative percent frequency distribution shows 
the percentage of data items with values less than or equal to the upper limit of each 
class. The cumulative relative frequency distribution can be computed either by sum- 
ming the relative frequencies in the relative frequency distribution or by dividing the 
cumulative frequencies by the total number of items. Using the latter approach, we found 
the cumulative relative frequencies in column 3 of Table 2.7 by dividing the cumulative 
frequencies in column 2 by the total number of items (n = 20). The cumulative percent 
frequencies were again computed by multiplying the relative frequencies by 100. The 
cumulative relative and percent frequency distributions show that .85 of the audits, or 
85%, were completed in 24 days or less, .95 of the audits, or 95%, were completed in 
29 days or less, and so on. 


Stem-and-Leaf Display 


A stem-and-leaf display is a graphical display used to show simultaneously the rank order 
and shape of a distribution of data. To illustrate the use of a stem-and-leaf display, con- 
sider the data in Table 2.8. These data result from a 150-question aptitude test given to 50 
individuals recently interviewed for a position at Haskens Technology. The data indicate the 
number of questions answered correctly. 


TABLE 2.8 Number of Questions Answered Correctly on an Aptitude Test 


112 72 69 97 107 
73 92 76 86 73 

126 128 118 127 124 

DATA file 82 104 132 134 83 
AptitudeTest 92 108 96 100 92 
115 76 91 102 81 

95 141 81 80 106 

84 119 113 98 75 

68 98 115 106 95 

100 85 94 106 119 
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To develop a stem-and-leaf display, we first arrange the leading digits of each data value 
to the left of a vertical line. To the right of the vertical line, we record the last digit for each 
data value. Based on the top row of data in Table 2.8 (112, 72, 69, 97, and 107), the first 
five entries in constructing a stem-and-leaf display would be as follows: 


6/9 
72 
8 
Di) 7 
10 | 7 
11 |2 
12 
13 
14 


For example, the data value 112 shows the leading digits 11 to the left of the line and the 

last digit 2 to the right of the line. Similarly, the data value 72 shows the leading digit 7 to 
the left of the line and last digit 2 to the right of the line. Continuing to place the last digit 
of each data value on the line corresponding to its leading digit(s) provides the following: 


6 
7 
8 
9 


10 
11 


A o o A N N U CO 
N Uv ONUA 
A O DODD FW 
Wn Ne OA 
ane O 


=. NDWDNYNNANIN DAW © 


With this organization of the data, sorting the digits on each line into rank order is 
simple. Doing so provides the stem-and-leaf display shown here. 


6|8 9 

7|2 3 3 5 6 6 

8/0 1 1 2 3 4°55 

9/1 2 2 2 4 5 5 8 8 
10;0 0 2 4 6 6 6 

11/2 3 5 5 8 9 9 
12|4 6 7 8 

13/2 4 

14/1 


The numbers to the left of the vertical line (6, 7, 8, 9, 10, 11, 12, 13, and 14) form the stem, 
and each digit to the right of the vertical line is a leaf. For example, consider the first row 
with a stem value of 6 and leaves of 8 and 9. 


618 9 
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This row indicates that two data values have a first digit of six. The leaves show that the 
data values are 68 and 69. Similarly, the second row 


712 3 3 5 6 6 


indicates that six data values have a first digit of seven. The leaves show that the data val- 
ues are 72, 73, 73, 75, 76, and 76. 

To focus on the shape indicated by the stem-and-leaf display, let us use a rectangle to 
contain the leaves of each stem. Doing so, we obtain the following: 


6/8 9 

T2 3 3 5 6 6 

8/0 1 1 23 4 5 6 
9}1 22245 5 67 8 8 
1010 02 46 66 7 8 
M2 3558 9 9 
12}4 6 7 8 

13}2 4 

14} 1 


Rotating this page counterclockwise onto its side provides a picture of the data that is simi- 
lar to a histogram with classes of 60-69, 70-79, 80-89, and so on. 

Although the stem-and-leaf display may appear to offer the same information as a histo- 
gram, it has two primary advantages. 


1. The stem-and-leaf display is easier to construct by hand. 
2. Within a class interval, the stem-and-leaf display provides more information than 
the histogram because the stem-and-leaf shows the actual data. 


Just as a frequency distribution or histogram has no absolute number of classes, neither 
does a stem-and-leaf display have an absolute number of rows or stems. If we believe that 
our original stem-and-leaf display condensed the data too much, we can easily stretch the 
display by using two or more stems for each leading digit. For example, to use two stems 
for each leading digit, we would place all data values ending in 0, 1, 2, 3, and 4 in one row 
and all values ending in 5, 6, 7, 8, and 9 in a second row. The following stretched stem- 
and-leaf display illustrates this approach. 


6/8 9 
P| 2-3 33 
TIS 6 6 
In a stretched stem-and-leaf gio 1123 4 
display, whenever a stem 
value is stated twice, the 8/5 6 
first value corresponds to 9/1 2224 
leaf values of 0-4, and the 9/5 5 6 7 8 8 
second value corresponds 10;0 0 2 4 
to leaf values of 5-9. 1016 6 6 7 8 
11/2 3 
1115 5 8 9 9 
12| 4 
12/6 7 8 
13/2 4 
13 
14| 1 
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A single digit is used 

to define each leaf in a 
stem-and-leaf display. The 
leaf unit indicates how to 
multiply the stem-and- 
leaf numbers in order to 
approximate the original 
data. Leaf units may be 
100, 10, 1, 0.1, and so on. 


59 


2.2 Summarizing Data for a Quantitative Variable 


Note that values 72, 73, and 73 have leaves in the 0—4 range and are shown with the first 
stem value of 7. The values 75, 76, and 76 have leaves in the 5—9 range and are shown with 
the second stem value of 7. This stretched stem-and-leaf display is similar to a frequency 
distribution with intervals of 65—69, 70-74, 75-79, and so on. 

The preceding example showed a stem-and-leaf display for data with as many as 
three digits. Stem-and-leaf displays for data with more than three digits are possible. For 
example, consider the following data on the number of hamburgers sold by a fast-food 
restaurant for each of 15 weeks. 


1565 1852 1644 1766 
1790 1679 2008 1852 


A stem-and-leaf display of these data follows. 
Leaf unit = 10 


1888 
1967 


1912 
1954 


2044 
1733 


1812 


15| 6 

16) 4 7 
17|3 6 9 
18; 1 5 5 8 
19} 1 5 6 
20/0 4 


Note that a single digit is used to define each leaf and that only the first three digits of each 
data value have been used to construct the display. At the top of the display we have speci- 
fied Leaf unit = 10. To illustrate how to interpret the values in the display, consider the first 
stem, 15, and its associated leaf, 6. Combining these numbers, we obtain 156. To reconstruct 
an approximation of the original data value, we must multiply this number by 10, the value 
of the leaf unit. Thus, 156 X 10 = 1,560 is an approximation of the original data value used 
to construct the stem-and-leaf display. Although it is not possible to reconstruct the exact 
data value from this stem-and-leaf display, the convention of using a single digit for each leaf 
enables stem-and-leaf displays to be constructed for data having a large number of digits. For 
stem-and-leaf displays where the leaf unit is not shown, the leaf unit is assumed to equal 1. 


NOTES + COMMENTS 


1. A bar chart and a histogram are essentially the same thing; 


the limits would be stated in hundredths of days. For in- 


both are graphical presentations of the data in a frequency 
distribution. A histogram is just a bar chart with no separation 
between bars. For some discrete quantitative data, a separa- 
tion between bars is also appropriate. Consider, for example, 
the number of classes in which a college student is enrolled. 
The data may only assume integer values. Intermediate 
values such as 1.5 and 2.73 are not possible. With continuous 
quantitative data, however, such as the audit times in Table 
2.4, a separation between bars is not appropriate. 

. The appropriate values for the class limits with quantitat- 
ive data depend on the level of accuracy of the data. For 
instance, with the audit time data of Table 2.4 the limits 
used were integer values. If the data were rounded to 
the nearest tenth of a day (e.g., 12.3 and 14.4), then the 
limits would be stated in tenths of days. For instance, 
the first class would be 10.0-14.9. If the data were recorded 
to the nearest hundredth of a day (e.g., 12.34 and 14.45), 


stance, the first class would be 10.00-14.99. 


. An open-end class requires only a lower class limit or an 


upper class limit. For example, in the audit time data of 
Table 2.4, suppose two of the audits had taken 58 and 65 
days. Rather than continue with the classes of width 5 with 
classes 35-39, 40-44, 45-49, and so on, we could simplify 
the frequency distribution to show an open-end class of 
“35 or more.” This class would have a frequency of 2. Most 
often the open-end class appears at the upper end of the 
distribution. Sometimes an open-end class appears at the 
lower end of the distribution, and occasionally such classes 
appear at both ends. 


. The last entry in a cumulative frequency distribution always 


equals the total number of observations. The last entry in 
a cumulative relative frequency distribution always equals 
1.00 and the last entry in a cumulative percent frequency 
distribution always equals 100. 
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EXERCISES 


Methods 
11. Consider the following data. 
14 21 23 21 16 
19 22 25 16 16 
ai 24 24 25 19 16 
SS DATA file 19 18 19 21 12 
16 17 18 23 25 
Frequency 
20 23 16 20 19 
24 26 15 22 24 
20 22 24 22 20 


a. Develop a frequency distribution using classes of 12-14, 15-17, 18-20, 21-23, and 
24-26. 
b. Develop a relative frequency distribution and a percent frequency distribution using 
the classes in part (a). 
12. Consider the following frequency distribution. 


Class Frequency 
10-19 10 
20-29 14 
30-39 17 
40-49 i, 
50-59 2 


Construct a cumulative frequency distribution and a cumulative relative frequency 
distribution. 

13. Construct a histogram for the data in exercise 12. 

14. Consider the following data. 


8.9 10.2 11:5 7.8 10.0 12.2 13.5 14.1 10.0 12.2 
6.8 9.5 11.5 11.2 14.9 75 10.0 6.0 15.8 11.5 


a. Construct a dot plot. 
b. Construct a frequency distribution. 
c. Construct a percent frequency distribution. 
15. Construct a stem-and-leaf display for the following data. 


11.3 9.6 10.4 T3 8.3 10.5 10.0 
9.3 8.1 77 75 8.4 6.3 8.8 
16. Construct a stem-and-leaf display for the following data. Use a leaf unit of 10. 
1161 1206 1478 1300 1604 1725 1361 1422 
1221 1378 1623 1426 1557 1730 1706 1689 
Applications 


17. Patient Waiting Times. A doctor’s office staff studied the waiting times for patients 
who arrive at the office with a request for emergency service. The following data with 
waiting times in minutes were collected over a one-month period. 


2 5 10 12 4 4 5 17 11 8 9 8 12 21 6 8 7 13 18 3 


Use classes of 0-4, 5—9, and so on in the following: 
a. Show the frequency distribution. 
b. Show the relative frequency distribution. 
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c. Show the cumulative frequency distribution. 

d. Show the cumulative relative frequency distribution. 

e. What proportion of patients needing emergency service wait 9 minutes or less? 
18. NBA Total Player Ratings. CBSSports.com developed the Total Player Ratings sys- 

tem to rate players in the National Basketball Association (NBA) based upon various 

offensive and defensive statistics. The following data show the average number of 

points scored per game (PPG) for 50 players with the highest ratings for a portion of 

an NBA season (CBSSports.com website). 


‘mae 270 288 264 27.1 229 284 192 210 208 176 
SS DATA file 21.1 192 B12 15.5 17.2 167 17.6 18.5 183 183 
NBAPlayerPts 233 164 189 165 170 117 157 180 177 146 

15.7 17.2 18.2 17.5 136 163 16.2 136 171 167 

17.0 173 175 140 169 163 151 123 18.7 146 


Use classes starting at 10 and ending at 30 in increments of 2 for PPG in the following. 
. Show the frequency distribution. 
. Show the relative frequency distribution. 
. Show the cumulative percent frequency distribution. 
. Develop a histogram for the average number of points scored per game. 
. Do the data appear to be skewed? Explain. 
What percentage of the players averaged at least 20 points per game? 
19. Busiest North American Airports. Based on the total passenger traffic, the airports 
in the following list are the 20 busiest airports in North America in 2018 (The World 


moan o pf 


Almanac). 
Airport (Airport Code) Total Passengers (Million) 
Boston Logan (BOS) 36.3 
Charlotte Douglas (CLT) 44.4 
Chicago O'Hare (ORD) 78 
Dallas/Ft. Worth (DFW) 65.7 
Denver (DEN) 58.3 
Detroit Metropolitan (DTW) 34.4 
E> : Hartsfield-Jackson Atlanta (ATL) 104.2 
= DATA file Houston George Bush (IAH) 41.6 
Airports Las Vegas McCarran (LAS) ANTS 
Los Angeles (LAX) 80.9 
Miami (MIA) 44.6 
Minneapolis/St. Paul (MSP) 37.4 
New York John F. Kennedy (JFK) 59.1 
Newark Liberty (EWR) 40.6 
Orlando (MCO) 41.9 
Philadelphia (PHL) 36.4 
Phoenix Sky Harbor (PHX) 43.3 
San Francisco (SFO) Soul 
Seattle-Tacoma (SEA) 457 
Toronto Pearson (YYZ) 44.3 


a. Which is busiest airport in terms of total passenger traffic? Which is the least busy 
airport in terms of total passenger traffic? 

b. Using a class width of 10, develop a frequency distribution of the data starting with 
30-39.9, 40-49.9, 50—59.9, and so on. 

c. Prepare a histogram. Interpret the histogram. 
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20. CEO Time in Meetings. The London School of Economics and the Harvard 
Business School have conducted studies of how chief executive officers (CEOs) 
spend their time. These studies have found that CEOs spend many hours per week 
in meetings, not including conference calls, business meals, and public events 
(The Wall Street Journal). Shown below is the time spent per week in meetings 
(hours) for a sample of 25 CEOs. 


(@ 


(@ 


. 14 15 18 23 15 
DATA file 19 20 13 15 23 
CEOTime 23 21 15 20 21 
16 15 18 18 19 
19 22 23 21 12 
a. What is the least amount of time spent per week on meetings? The highest? 
b. Use a class width of two hours to prepare a frequency distribution and a percent 
frequency distribution for the data. 
c. Prepare a histogram and comment on the shape of the distribution. 
21. Largest University Endowments. University endowments are financial assets that 
are donated by supporters to be used to provide income to universities. There is a 
large discrepancy in the size of university endowments. The following table provides a 
listing of many of the universities that have the largest endowments as reported by the 
National Association of College and University Business Officers in 2017. 
Endowment Endowment 
University Amount ($ billions) University Amount ($ billions) 
Amherst College 22 Smith College 1.8 
Boston College 23 Stanford University 24.8 
Boston University 2.0 Swarthmore College 20 
Brown University 32 Texas A&M University 11.6 
California Institute Tufts University led 
of Technology 2.6 University of California 9.8 
Carnegie Mellon University of California, 
University 22 Berkeley 1.8 
DATA file Case Western University of California, 
Endowments Reserve University 1.8 Los Angeles 2 
Columbia University 10.0 University of Chicago TES) 
Cornell University 6.8 University of Illinois 2.6 
Dartmouth College 5.0 University of Michigan 10.9 
Duke University Us) University of Minnesota 3.5 
Emory University 6.9 University of North Carolina 
George Washington at Chapel Hill 3.0 
University 1 University of Notre Dame 9.4 
Georgetown University 17 University of Oklahoma 1.6 
Georgia Institute University of Pennsylvania 1222 
of Technology 2.0 University of Pittsburgh 392 
Grinnell College I University of Richmond 2.4 
Harvard University 36.0 University of Rochester 2 
Indiana University 22 University of Southern 
Johns Hopkins University 3.8 California 5a 
Massachusetts Institute University of Texas 26.5 
of Technology 15.0 University of Virginia 8.6 
Michigan State University 2.7 University of Washington 25 
New York University 4.0 University of 
Northwestern University 10.4 Wisconsin-Madison Dol} 
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Ohio State University 4.3 Vanderbilt University 4.1 
Pennsylvania State Virginia Commonwealth 

University 4.0 University 1.8 
Pomona College 22. Washington University in 
Princeton University 23.8 St. Louis Lg) 
Purdue University 2.4 Wellesley College 1 
Rice University 5.8 Williams College 25 
Rockefeller University 2.0 Yale University 272 


Summarize the data by constructing the following: 

A frequency distribution (classes 0-1.9, 2.0-3.9, 4.0-5.9, 6.0-7.9, and so on). 

A relative frequency distribution. 

A cumulative frequency distribution. 

A cumulative relative frequency distribution. 

What do these distributions tell you about the endowments of universities? 

Show a histogram. Comment on the shape of the distribution. 
g. What is the largest university endowment and which university holds it? 

22. Top U.S. Franchises. Entrepreneur magazine ranks franchises using performance 
measures such as growth rate, number of locations, startup costs, and financial 
stability. The number of locations for the top 20 U.S. franchises follow (The World 


moans.» 


Almanac). 
No. U.S. No. U.S. 
Franchise Locations Franchise Locations 
Hampton Inn 1,864 Jan-Pro Franchising Intl. Inc. 12,394 
ee S ampm 3183 Hardee's 1,901 
= DATA file McDonald's 32,805 Pizza Hut Inc. 13,281 
Franchise 7-Eleven Inc. 37,496 Kumon Math & Reading Centers 25,199 
Supercuts 2,130 Dunkin’ Donuts IAT 
Days Inn 1,877 KFC Corp. 16,224 
Vanguard Cleaning Systems 21155 Jazzercise Inc. 7,683 
Servpro 172 Anytime Fitness 1,618 
Subway 34,871 Matco Tools 1,431 
Denny's Inc. 1,668 Stratus Building Solutions 5,018 


Use classes 0-4999, 5000-9999, 10,000-14,999, and so forth to answer the following 

questions. 

a. Construct a frequency distribution and a percent frequency distribution of the num- 
ber of U.S. locations for these top-ranked franchises. 

b. Construct a histogram of these data. 

c. Comment on the shape of the distribution. 

23. Percent Change in Stock Market Indexes. The following data show the year-to-date 
percent change (YTD % Change) for 30 stock-market indexes from around the world 
(The Wall Street Journal). 

a. What index has the largest positive YTD % Change? 

b. Using a class width of 5 beginning with —20 and going to 40, develop a frequency 
distribution for the data. 

c. Prepare a histogram. Interpret the histogram, including a discussion of the general 
shape of the histogram. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


64 Chapter 2 Descriptive Statistics: Tabular and Graphical Displays 


Country Index YTD % Change 
Australia S&P/ASX200 10.2 
Belgium Bel-20 12.6 
Brazil São Paulo Bovespa —14.4 
Canada S&P/TSX Comp 2.6 
Chile Santiago IPSA = ls} 
China Shanghai Composite =FR) 
Eurozone EURO Stoxx 10.0 
France CAC 40 11.8 
Germany DAX 10.6 
Hong Kong Hang Seng = EKS 
India S&P BSE Sensex —4.7 
Israel Tel Aviv ies 
Italy FTSE MIB 6.6 
Japan Nikkei 31.4 
O s Mexico IPC All-Share 6A 
= DATA file Netherlands AEX Tes) 
Marketindexes Singapore Straits Times T25 
South Korea Kospi —6.4 
Spain IBEX 35 6.4 
Sweden SX All Share sR) 
Switzerland Swiss Market 17.4 
Taiwan Weighted 23 
UIK FTSE 100 10.1 
WES S&P 500 16.6 
WES: DJIA 14.5 
WES: Dow Jones Utility 6.6 
US: Nasdaq 100 17.4 
WES: Nasdaq Composite 2071 
World DJ Global ex U.S. 4.2 
World DJ Global Index 23 


d. Use The Wall Street Journal or another media source to find the current percent changes 

for these stock market indexes in the current year. Which index has had the largest 
percent increase? Which index has had the smallest percent decrease? Prepare a 
summary of the data. 

DATA file 24. Engineering School Graduate Salaries. The file EngineeringSalary contains the me- 

EngineeringSalary dian starting salary and median mid-career salary (measured 10 years after graduation) for 
graduates from 19 engineering schools (The Wall Street Journal). Develop a stem-and-leaf 
display for both the median starting salary and the median mid-career salary. Comment on 
any differences you observe. 

25. Best Paying College Degrees. Each year America.edu ranks the best paying college 
degrees in America. The following data show the median starting salary, the mid-ca- 
reer salary, and the percentage increase from starting salary to mid-career salary for 
the 20 college degrees with the highest mid-career salary (America.edu website). 

a. Using a class width of 10, construct a histogram for the percentage increase in the 
starting salary. 
b. Comment on the shape of the distribution. 
Develop a stem-and-leaf display for the percentage increase in the starting salary. 
d. What are the primary advantages of the stem-and-leaf display as compared to the 
histogram? 


(@ 


© 
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Degree Starting Salary §Mid-Career Salary % Increase 
Aerospace engineering 59,400 108,000 82 
Applied mathematics 56,400 101,000 79 
Biomedical engineering 54,800 101,000 84 
Chemical engineering 64,800 108,000 67 
Civil engineering 53,500 93,400 75 
Computer engineering 61,200 87,700 43 
Computer science 56,200 97,700 74 
Construction management 50,400 87,000 IS 
Economics 48,800 97,800 100 
Electrical engineering 60,800 104,000 71 
> : Finance 47,500 91,500 93 
& DATA file Government 41,500 88,300 113 
BestPayingDegrees Information systems 49,300 87,100 77 
Management info. systems 50,900 90,300 77 
Mathematics 46,400 88,300 90 
Nuclear engineering 63,900 104,000 63 
Petroleum engineering 93,000 157,000 69 
Physics 50,700 99 600 96 
Software engineering 56,700 91,300 61 
Statistics 50,000 93,400 87 
26. Marathon Runner Ages. The Flying Pig is a marathon (26.2 miles long) running race 
held every year in Cincinnati, Ohio. Suppose that the following data show the ages for 
a sample of 40 half-marathoners. 
49 33 40 37 56 
i 44 46 57 55 32 
es i 50 52 43 64 40 
Sele 46 24 30 37 43 
Marathon 
31 43 50 36 61 
27 44 35 31 43 
52 43 66 31 50 
72 26 59 21 47 


a. Construct a stretched stem-and-leaf display. 
b. What age group had the largest number of runners? 
c. What age occurred most frequently? 


2.3 Summarizing Data for Two Variables Using Tables 


Thus far in this chapter, we have focused on using tabular and graphical displays to 
summarize the data for a single categorical or quantitative variable. Often a manager 

or decision maker needs to summarize the data for two variables in order to reveal the 
relationship—if any—between the variables. In this section, we show how to construct a 
tabular summary of the data for two variables. 


Crosstabulation 


A crosstabulation is a tabular summary of data for two variables. Although both vari- 
ables can be either categorical or quantitative, crosstabulations in which one variable is 
categorical and the other variable is quantitative are just as common. We will illustrate this 
latter case by considering the following application based on data from Zagat’s Restaurant 
Review. Data showing the quality rating and the typical meal price were collected for a 
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a 
oa 


DATA file 


Restaurant 


(@ 


Grouping the data for 

a quantitative variable 
enables us to treat the 
quantitative variable as 
if it were a categorical 
variable when creating a 
crosstabulation. 
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TABLE 2.9 Quality Rating and Meal Price Data for 300 Los Angeles 


Restaurants 


Restaurant Quality Rating Meal Price ($) 
1 Good 18 
2 Very Good 22 
3 Good 28 
4 Excellent 38 
5 Very Good 33 
6 Good 28 
7 Very Good 19 
8 Very Good 11 
9 Very Good 29 

10 Good 18 


sample of 300 restaurants in the Los Angeles area. Table 2.9 shows the data for the first 
10 restaurants. Quality rating is a categorical variable with rating categories of Good, Very 
Good, and Excellent. Meal Price is a quantitative variable that ranges from $10 to $49. 

A crosstabulation of the data for this application is shown in Table 2.10. The labels shown 
in the margins of the table define the categories (classes) for the two variables. In the left 
margin, the row labels (Good, Very Good, and Excellent) correspond to the three rating cat- 
egories for the quality rating variable. In the top margin, the column labels ($10-19, $20-29, 
$30-39, and $40-49) show that the Meal Price data have been grouped into four classes. 
Because each restaurant in the sample provides a quality rating and a meal price, each res- 
taurant is associated with a cell appearing in one of the rows and one of the columns of the 
crosstabulation. For example, Table 2.9 shows restaurant 5 as having a Very Good quality 
rating and a Meal Price of $33. This restaurant belongs to the cell in row 2 and column 3 of 
the crosstabulation shown in Table 2.10. In constructing a crosstabulation, we simply count 
the number of restaurants that belong to each of the cells. 

Although four classes of the Meal Price variable were used to construct the crosstabulation 
shown in Table 2.10, the crosstabulation of quality rating and meal price could have been de- 
veloped using fewer or more classes for the meal price variable. The issues involved in deciding 
how to group the data for a quantitative variable in a crosstabulation are similar to the issues 
involved in deciding the number of classes to use when constructing a frequency distribution for 
a quantitative variable. For this application, four classes of meal price was considered a reason- 
able number of classes to reveal any relationship between quality rating and meal price. 


TABLE 2.10 Crosstabulation of Quality Rating and Meal Price Data 


for 300 Los Angeles Restaurants 


Meal Price 
Quality Rating $10-19  $20-29 $30-39  $40-49 Total 
Good 42 40 2 0 84 
Very Good 34 64 46 6 150 
Excellent 2 14 28 22 66 
Total 78 118 76 28 300 


Note that the sum of the 
values shown in the relative 
frequency column does not 
add exactly to 1.00 and the 
sum of the values shown 

in the percent frequency 
distribution does not add 
exactly to 100; the reason 
is that the values shown are 
rounded. 
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In reviewing Table 2.10, we see that the greatest number of restaurants in the 
sample (64) have a very good rating and a meal price in the $20-29 range. Only two 
restaurants have an excellent rating and a meal price in the $10-19 range. Similar in- 
terpretations of the other frequencies can be made. In addition, note that the right and 
bottom margins of the crosstabulation provide the frequency distributions for quality 
rating and meal price separately. From the frequency distribution in the right margin, 
we see that data on quality ratings show 84 restaurants with a good quality rating, 150 
restaurants with a very good quality rating, and 66 restaurants with an excellent quality 
rating. Similarly, the bottom margin shows the frequency distribution for the meal 
price variable. 

Dividing the totals in the right margin of the crosstabulation by the total for that column 
provides a relative and percent frequency distribution for the quality rating variable. 


Quality Rating Relative Frequency Percent Frequency 


Good 28 28 
Very Good S0 50 
Excellent 22 22 
Total 1.00 100 


From the percent frequency distribution we see that 28% of the restaurants were rated 
good, 50% were rated very good, and 22% were rated excellent. 

Dividing the totals in the bottom row of the crosstabulation by the total for that row 
provides a relative and percent frequency distribution for the meal price variable. 


Meal Price Relative Frequency Percent Frequency 
$10-19 26 26 
$20-29 Fu, 39 
$30-39 25 25 
$40-49 09 9 
Total 1.00 100 


From the percent frequency distribution we see that 26% of the meal prices are in the low- 
est price class ($10-19), 39% are in the next higher class, and so on. 

The frequency and relative frequency distributions constructed from the margins of 
a crosstabulation provide information about each of the variables individually, but they 
do not shed any light on the relationship between the variables. The primary value of a 
crosstabulation lies in the insight it offers about the relationship between the variables. A 
review of the crosstabulation in Table 2.10 reveals that restaurants with higher meal prices 
received higher quality ratings than restaurants with lower meal prices. 

Converting the entries in a crosstabulation into row percentages or column percent- 
ages can provide more insight into the relationship between the two variables. For row 
percentages, the results of dividing each frequency in Table 2.10 by its corresponding 
row total are shown in Table 2.11. Each row of Table 2.11 is a percent frequency distri- 
bution of meal price for one of the quality rating categories. Of the restaurants with the 
lowest quality rating (good), we see that the greatest percentages are for the less expens- 
ive restaurants (50% have $10-19 meal prices and 47.6% have $20-29 meal prices). 

Of the restaurants with the highest quality rating (excellent), we see that the greatest 
percentages are for the more expensive restaurants (42.4% have $30-39 meal prices and 
33.4% have $40-49 meal prices). Thus, we continue to see that restaurants with higher 
meal prices received higher quality ratings. 
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TABLE 2.11 Row Percentages for Each Quality Rating Category 


Meal Price 
Quality Rating $10-19 $20-29 $30-39 $40-49 Total 
Good 50.0 47.6 2.4 0.0 100 
Very Good 22 42.7 30.6 4.0 100 
Excellent 3.0 2 2 42.4 33.4 100 


Crosstabulations are widely used to investigate the relationship between two variables. 
In practice, the final reports for many statistical studies include a large number of cross- 
tabulations. In the Los Angeles restaurant survey, the crosstabulation is based on one 
categorical variable (Quality Rating) and one quantitative variable (Meal Price). Crosstabu- 
lations can also be developed when both variables are categorical and when both variables 
are quantitative. When quantitative variables are used, however, we must first create classes 
for the values of the variable. For instance, in the restaurant example we grouped the meal 
prices into four classes ($10—19, $20-29, $30-39, and $40-49). 


Using Excel's PivotTable Tool to Construct a Crosstabulation 


Excel’s PivotTable tool can be used to summarize the data for two or more variables simul- 
taneously. We will illustrate the use of Excel’s PivotTable tool by showing how to develop 
a crosstabulation of quality ratings and meal prices for the sample of 300 restaurants 
located in the Los Angeles area. 


= i 
== DATA file 
mat eee Enter/Access Data: Open the file Restaurant. The data are in cells B2:C301 and labels are 


in column A and cells B1:C1. 


Apply Tools: Each of the three columns in the Restaurant data set [labeled Restaurant, Quality 
Rating, and Meal Price ($)] is considered a field by Excel. Fields may be chosen to represent 
rows, columns, or values in the PivotTable. The following steps describe how to use Excel’s 
PivotTable tool to construct a crosstabulation of quality ratings and meal prices. 


Step 1. Select cell Al or any cell in the data set 
Step 2. Click the Insert tab on the Ribbon 
Step 3. In the Tables group click PivotTable 
Step 4. When the Create PivotTable dialog box appears: 
Click OK; a PivotTable and PivotTable Fields task pane will appear in a new 
worksheet 
Step 5. In the PivotTable Fields task pane: 
Drag Quality Rating to the ROWS area 
Drag Meal Price to the COLUMNS area 
Drag Restaurant to the VALUES area 
Step 6. Click on Sum of Restaurant in the VALUES area 
Step 7. Select Value Field Settings... from the list of options that appears 
Step 8. When the Value Field Settings dialog box appears: 
Under Summarize value field by, choose Count 
Click OK 


Figure 2.16 shows the PivotTable Fields task pane and the corresponding PivotTable cre- 
ated following the above steps. For readability, columns H:AC have been hidden. 


Editing Options: To complete the PivotTable we need to group the rows containing the 
meal prices and place the rows for quality rating in the proper order. The following steps 
accomplish this. 
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FIGURE 2.16 | Initial PivotTable Fields Task Pane and PivotTable for 


the Restaurant Data 


B CDEFGHIJKLMNOPOQORS T UV WX Y ZAAABACADAEAFAGAH AI AJAKALAM AN a A " 3 
PivotTable Fields a 
x Choose fields to add to report: o- 
10 11 32 13.14 NE 16 17 EB 19 20 21 22 29 24 2 26 17 28 29 30 31 32 39 34 38 M6 IT ME AD 41 AD 49 44 aE 46 4? Grand Total Goes P 
1 1 E H E E E a a 43332342244212 2322 kad 
Ostaxzx2aassowisxsstrsaess3 ' 1 “u © Restaurant, 
145645 FRR O9HO4HES HS EM 7s 66ss5s246l tet i 1%. Quality Roti 
E EE E E EE TE E EE TE T Tee Te a a a E E E e 3 T aTe En 
MORE TABLES. 
Drag heids betamen areas below 
T nms E COWMNS 
Meal Price ($) s, 
= ROWS X vats 
Quaity Raving =. Court of Revtauent © 


Step 1. Right-click cell B4 in the PivotTable or any other cell containing meal prices. 
Step 2. Choose Group from the list of options that appears 


Step 3. When the Grouping dialog box appears: 
Enter 70 in the Starting at: box 
Enter 49 in the Ending at: box 
Enter /0 in the By: box 
Click OK 

Step 4. Right-click on Excellent in cell A5 


Step 5. Select Move and click Move “Excellent” to End 


The final PivotTable is shown in Figure 2.17. Note that it provides the same information as 


the crosstabulation shown in Table 2.10. 


Simpson's Paradox 


The data in two or more crosstabulations are often combined or aggregated to produce a 
summary crosstabulation showing how two variables are related. In such cases, conclusions 
drawn from two or more separate crosstabulations can be reversed when the data are ag- 
gregated into a single crosstabulation. The reversal of conclusions based on aggregate and 


FIGURE 2.17 | Final PivotTable for the Restaurant Data 
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PivotTable Fields fs 
Choose fields to add to report: o- 
Search p 
Y Restaurant 
Y Quality Rating 
Y Meal Price ($) 
MORE TABLES... 
Drag fields between areas below: 
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Meal Price ($) hd 
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Quality Rating ~  Countof Restaurant v 
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unaggregated data is called Simpson’s paradox. To provide an illustration of Simpson’s 
paradox we consider an example involving the analysis of verdicts for two judges in two 
different courts. 

Judges Ron Luckett and Dennis Kendall presided over cases in Common Pleas Court 
and Municipal Court during the past three years. Some of the verdicts they rendered 
were appealed. In most of these cases the appeals court upheld the original verdicts, 
but in some cases those verdicts were reversed. For each judge a crosstabulation was 
developed based upon two variables: Verdict (upheld or reversed) and Type of Court 
(Common Pleas and Municipal). Suppose that the two crosstabulations were then com- 
bined by aggregating the type of court data. The resulting aggregated crosstabulation 
contains two variables: Verdict (upheld or reversed) and Judge (Luckett or Kendall). 
This crosstabulation shows the number of appeals in which the verdict was upheld and 
the number in which the verdict was reversed for both judges. The following crosstabu- 
lation shows these results along with the column percentages in parentheses next to 


each value. 
Judge 
Verdict Luckett Kendall Total 
Upheld 129 (86%) 110 (88%) 239 
Reversed 21 (14%) 15 (12%) 36 
Total (%) 150 (100%) 125 (100%) 275 


A review of the column percentages shows that 86% of the verdicts were upheld for 
Judge Luckett, whereas 88% of the verdicts were upheld for Judge Kendall. From this 
aggregated crosstabulation, we conclude that Judge Kendall is doing the better job because 
a greater percentage of Judge Kendall’s verdicts are being upheld. 

The following unaggregated crosstabulations show the cases tried by Judge Luckett 
and Judge Kendall in each court; column percentages are shown in parentheses next to 
each value. 


Judge Luckett 


Judge Kendall 


Common Municipal Common Municipal 
Verdict Pleas Court Total Verdict Pleas Court Total 
Upheld 29 (91%) 100 (85%) 129 Upheld 90 (90%) 20 (80%) 110 
Reversed 3 (9%) 18 (15%) 21 Reversed 10 (10%) 5 (20%) 15 
Total (%) 32 (100%) 118 (100%) 150 Total (%) 100 (100%) 25 (100%) 125 


From the crosstabulation and column percentages for Judge Luckett, we see that the 
verdicts were upheld in 91% of the Common Pleas Court cases and in 85% of the 
Municipal Court cases. From the crosstabulation and column percentages for Judge 
Kendall, we see that the verdicts were upheld in 90% of the Common Pleas Court 
cases and in 80% of the Municipal Court cases. Thus, when we unaggregate the data, 
we see that Judge Luckett has a better record because a greater percentage of Judge 
Luckett’s verdicts are being upheld in both courts. This result contradicts the conclu- 
sion we reached with the aggregated data crosstabulation that showed Judge Kendall 
had the better record. This reversal of conclusions based on aggregated and unaggreg- 
ated data illustrates Simpson’s paradox. 
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The original crosstabulation was obtained by aggregating the data in the separate 
crosstabulations for the two courts. Note that for both judges the percentage of appeals that 
resulted in reversals was much higher in Municipal Court than in Common Pleas Court. 
Because Judge Luckett tried a much higher percentage of his cases in Municipal Court, the 
aggregated data favored Judge Kendall. When we look at the crosstabulations for the two 
courts separately, however, Judge Luckett shows the better record. Thus, for the original 
crosstabulation, we see that the type of court is a hidden variable that cannot be ignored 
when evaluating the records of the two judges. 

Because of the possibility of Simpson’s paradox, realize that the conclusion or 
interpretation may be reversed depending upon whether you are viewing unaggregated 
or aggregated crosstabulation data. Before drawing a conclusion, you may want to 
investigate whether the aggregate or unaggregate form of the crosstabulation provides 
the better insight and conclusion. Especially when the crosstabulation involves aggreg- 
ated data, you should investigate whether a hidden variable could affect the results 
such that separate or unaggregated crosstabulations provide a different and possibly 
better insight and conclusion. 


Methods 
27. The following data are for 30 observations involving two categorical variables, x and y. 
The categories for x are A, B, and C; the categories for y are 1 and 2. 


Observation x y Observation x y 

1 A 1 16 B 2 

2 B 1 17 @ 1 

SP : 3 B 1 18 B 1 
DAIA file 4 € 2 19 € 1 
Crosstab 5 B 1 20 B 1 

6 € 2 21 @ 2 

7 B 1 22 B 1 

8 G 2 23 Cc 2 

9, A 1 24 A 1 

10 B 1 25 B 1 

11 A 1 26 @ 2 

12 B 1 27 Cc 2 

13 € 2 28 A 1 

14 € 2 29 B 1 

15 È 2 30 B 2 


a. Develop a crosstabulation for the data, with x as the row variable and y as the col- 
umn variable. 

b. Compute the row percentages. 

c. Compute the column percentages. 

d. What is the relationship, if any, between x and y? 

28. The following observations are for two quantitative variables, x and y. 

a. Develop a crosstabulation for the data, with x as the row variable and y as the 
column variable. For x use classes of 10-29, 30—49, and so on; for y use classes of 
40-59, 60-79, and so on. 

b. Compute the row percentages. 

Compute the column percentages. 
d. What is the relationship, if any, between x and y? 


© 
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Observation x y Observation x y 

1 28 72 Wl 13 98 

2 17 99 12 84 21 

3 52 58 13 59. 32 

4 79. 34 14 17 81 

5 37 60 15 70 34 

6 71 22 16 47 64 

7 37 77 17 35 68 

8 27 85 18 62 67 

9 64 45 19 30 Sy) 

10 53 47 20 43 28 

Applications 

29. Average Speeds of Daytona 500 Winners by Automobile Makes. The Daytona 


500 is a 500-mile automobile race held annually at the Daytona International Speed- 
way in Daytona Beach, Florida. The following crosstabulation shows the automobile 
make by average speed of the 25 winners from 1988 to 2012 (The World Almanac). 


Average Speed in Miles per Hour 


Make 130-139.9 140-149.9 150-159.9 160-169.9 170-179.9 | Total 

Buick 1 1 

Chevrolet 3 5 4 3 1 16 

Dodge 2 2 

Ford 2 1 2 1 6 

Total 6 8 6 4 1 25 
a. Compute the row percentages. 


c. 
d. 


What percentage of winners driving a Chevrolet won with an average speed of at 
least 150 miles per hour? 

Compute the column percentages. 

What percentage of winning average speeds 160-169.9 miles per hour were Chevrolets? 


30. Average Speeds of Daytona 500 Winners by Years. The following crosstabulation 
shows the average speed of the 25 winners by year of the Daytona 500 automobile 


Tac 


e (The World Almanac). 


Year 
Average Speed | 1988-1992 1993-1997 1998-2002 2003-2007 2008-2012 Total 
130-139.9 1 2 3 6 
140-149.9 2 2 1 2 1 8 
150-159.9 8 1 1 1 6 
160-169.9 2 2 4 
170-179.9 í 1 
Total 5 5 5 5 5 25 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


a. 
b. 


Calculate the row percentages. 
What is the apparent relationship between average winning speed and year? What 
might be the cause of this apparent relationship? 
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31. Golf Course Complaints. Recently, management at Oak Tree Golf Course received 
a few complaints about the condition of the greens. Several players complained that 
the greens are too fast. Rather than react to the comments of just a few, the Golf 
Association conducted a survey of 100 male and 100 female golfers. The survey 
results are summarized here. 


Male Golfers Female Golfers 

Greens Condition Greens Condition 
Handicap | Too Fast Fine Handicap | Too Fast Fine 
Under 15 10 40 Under 15 1 9 
15 or more 25 25 15 or more 39 51 


a. Combine these two crosstabulations into one with Male and Female as the row 
labels and Too Fast and Fine as the column labels. Which group shows the highest 
percentage saying that the greens are too fast? 

b. Refer to the initial crosstabulations. For those players with low handicaps (better 
players), which group (male or female) shows the highest percentage saying the 
greens are too fast? 

c. Refer to the initial crosstabulations. For those players with higher handicaps, which 
group (male or female) shows the highest percentage saying the greens are too fast? 

d. What conclusions can you draw about the preferences of men and women concern- 
ing the speed of the greens? Are the conclusions you draw from part (a) as com- 
pared with parts (b) and (c) consistent? Explain any apparent inconsistencies. 

32. Household Income Levels. The following crosstabulation shows the number of 
households (1000s) in each of the four regions of the United States and the number 

of households at each income level (U.S. Census Bureau website). 


Income Level of Household 


$15,000 $25,000 $35,000 $50,000 $75,000 Number of 
Under to to to to to $100,000 | Households 
Region $15,000 $24,999 $34,999 $49,999 $74,999 $99,999 and Over (1000s) 
Northeast 233 2,244 2,264 2,807 3,699 2,486 5,246 21,479 
Midwest 3,273 3,326 3,056 3,767 5,044 3183 4,742 26,391 
South 6,235 5,657 5,038 6,476 7,730 4,813 7,660 43,609 
West 3,086 2,796 2,644 3,557. 4,804 3,066 6,104 26,057 
Total IES 14,023 13,002 16,607 2217. 13,548 232 W7 S36 


a. Compute the row percentages and identify the percent frequency distributions of 
income for households in each region. 

b. What percentage of households in the West region have an income level of $50,000 
or more? What percentage of households in the South region have an income level 
of $50,000 or more? 

c. Construct percent frequency histograms for each region of households. Do any rela- 
tionships between regions and income level appear to be evident in your findings? 

d. Compute the column percentages. What information do the column percentages 
provide? 

e. What percent of households with a household income of $100,000 and over are 
from the South region? What percentage of households from the South region have a 
household income of $100,000 and over? Why are these two percentages different? 
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33. Value of World’s Most Valuable Brands. Each year Forbes ranks the world’s most 
valuable brands. A portion of the data for 82 of the brands in the 2013 Forbes list is 
shown in Table 2.12 (Forbes website). The data set includes the following variables: 


Brand: The name of the brand. 


Industry: The type of industry associated with the brand, labeled Automotive 
& Luxury, Consumer Packaged Goods, Financial Services, Other, Technology. 


Brand Value ($ billions): A measure of the brand’s value in billions of dollars 
developed by Forbes based on a variety of financial information about the brand. 


1-Yr Value Change (%): The percentage change in the value of the brand over the 
previous year. 


Brand Revenue ($ billions): The total revenue in billions of dollars for the brand. 


a. Prepare a crosstabulation of the data on Industry (rows) and Brand Value ($ bil- 
lions). Use classes of 0-10, 10-20, 20-30, 30—40, 40-50, and 50—60 for Brand 
Value ($ billions). 


. Prepare a frequency distribution for the data on Industry. 


(oz 


io) 


. Prepare a frequency distribution for the data on Brand Value ($ billions). 


d. How has the crosstabulation helped in preparing the frequency distributions in 
parts (b) and (c)? 
e. What conclusions can you draw about the type of industry and the brand value? 
34. Revenue of World’s Most Valuable Brands. Refer to Table 2.12. 


a. Prepare a crosstabulation of the data on Industry (rows) and Brand Revenue 
($ billions). Use class intervals of 25 starting at 0 for Brand Revenue ($ billions). 


b. Prepare a frequency distribution for the data on Brand Revenue ($ billions). 
c. What conclusions can you draw about the type of industry and the brand revenue? 


d. Prepare a crosstabulation of the data on Industry (rows) and the 1-Yr Value Change 
(%). Use class intervals of 20 starting at —60 for 1-Yr Value Change (%). 


e. Prepare a frequency distribution for the data on 1-Yr Value Change (%). 


f. What conclusions can you draw about the type of industry and the 1-year change in 
value? 


TABLE 2.12 Data for 82 of the Most Valuable Brands 


Brand Value 1-Yr Value Brand Revenue 


Brand Industry ($ billions) | Change (%) ($ billions) 
Accenture Other 9.7 10 30.4 
Adidas Other 8.4 23 14.5 
=> . Allianz Financial Services 6.9 5 130.8 
= 
= DATA file Amazon.Com Technology 14.7 44 60.6 
BrandValue 
Heinz Consumer Packaged Goods 5.6 2 4.4 
Hermes Automotive & Luxury SES 20 4.5 
Wells Fargo Financial Services 9 —14 912 
Zara Other 9.4 11 13-5 


Source: Data from Forbes, 2014. 
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TABLE 2.13 Fuel Efficiency Data for 341 Cars 


Fuel City Hwy 


Car Size Displacement Cylinders Drive Type MPG MPG 

1 Compact 1.4 4 F R 27 40 

2 Compact 1.4 4 [P R 27 35 

= 3 Compact 1.4 4 F R 28 38 
= DATA file 
FuelData2018 

190 Compact 25 4 F R 27 36 

IA Large 25 4 [P R 22 30 

192 Midsize 25 4 [F R 21 32 

339 Large 6.0 12 R B 18 21 

340 Large 6.0 12 R R is 22 

341 Large 6.0 12 R R 18 20 


35. Car Fuel Efficiencies. The U.S. Department of Energy’s Fuel Economy Guide 
provides fuel efficiency data for cars and trucks (Fuel Economy website). A portion of 
the data from 2018 for 341 compact, midsize, and large cars is shown in Table 2.13. 
The data set contains the following variables: 

Size: Compact, Midsize, and Large 

Displacement: Engine size in liters 

Cylinders: Number of cylinders in the engine 

Drive: All wheel (A), front wheel (F), and rear wheel (R) 

Fuel Type: Premium (P) or regular (R) fuel 

City MPG: Fuel efficiency rating for city driving in terms of miles per gallon 
Hwy MPG: Fuel efficiency rating for highway driving in terms of miles per gallon 


The complete data set is contained in the file FuelData2018. 

a. Prepare a crosstabulation of the data on Size (rows) and Hwy MPG (columns). 
Use classes of 20-24, 25-29, 30-34, 35-39, and 40-44 for Hwy MPG. 

b. Comment on the relationship between Size and Hwy MPG. 

c. Prepare a crosstabulation of the data on Drive (rows) and City MPG (columns). Use 
classes of 10-14, 15-19, 20-24, 25-29, and 30-34 for City MPG. 

d. Comment on the relationship between Drive and City MPG. 

e. Prepare a crosstabulation of the data on Fuel Type (rows) and City MPG (columns). 
Use classes of 10-14, 15-19, 20-24, 25-29, and 30-34 for City MPG. 

f. Comment on the relationship between Fuel Type and City MPG. 


2.4 Summarizing Data for Two Variables 

Using Graphical Displays 
In the previous section we showed how a crosstabulation can be used to summarize the 
data for two variables and help reveal the relationship between the variables. In most cases, 
a graphical display is more useful for recognizing patterns and trends in the data. 


In this section, we introduce a variety of graphical displays for exploring the relationships 
between two variables. Displaying data in creative ways can lead to powerful insights and 
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(Q 


DATA file 


Electronics 
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allow us to make “common-sense inferences” based on our ability to visually compare, con- 
trast, and recognize patterns. We begin with a discussion of scatter diagrams and trendlines. 


Scatter Diagram and Trendline 


A scatter diagram is a graphical display of the relationship between two quantitative vari- 
ables, and a trendline is a line that provides an approximation of the relationship. As an illus- 
tration, consider the advertising/sales relationship for an electronics store in San Francisco. 
On 10 occasions during the past three months, the store used weekend television commercials 
to promote sales at its stores. The managers want to investigate whether a relationship exists 
between the number of commercials shown and sales at the store during the following week. 
Sample data for the 10 weeks with sales in hundreds of dollars are shown in Table 2.14. 

Figure 2.18 shows the scatter diagram and the trendline’ for the data in Table 2.14. 
The number of commercials (x) is shown on the horizontal axis and the sales (y) are 


TABLE 2.14 Sample Data for the San Francisco Electronics Store 


Number of Commercials Sales ($100s) 


Week x y 
1 50 
2 5 57 
8 1 41 
4 3 54 
5 4 54 
6 1 38 
7 5 63 
8 3 48 
9. 4 59 

10 2 46 


FIGURE 2.18 Scatter Diagram and Trendline for the San Francisco 
Electronics Store 


Sales ($100s) 


0 1 2, 3 4 5 
Number of Commercials 


The equation of the trendline is y = 36.15 + 4.95x. The slope of the trendline is 4.95 and the y-intercept (the point 
where the trendline intersects the y-axis) is 36.15. We will discuss in detail the interpretation of the slope and y-intercept 
for a linear trendline in Chapter 14 when we study simple linear regression. 
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FIGURE 2.19 Types of Relationships Depicted by Scatter Diagrams 


No Apparent Relationship x 


Negative Relationship x 


shown on the vertical axis. For week 1, x = 2 and y = 50. A point with those coordinates 
is plotted on the scatter diagram. Similar points are plotted for the other nine weeks. 
Note that during two of the weeks one commercial was shown, during two of the weeks 
two commercials were shown, and so on. 

The scatter diagram in Figure 2.18 indicates a positive relationship between the 
number of commercials and sales. Higher sales are associated with a higher number of 
commercials. The relationship is not perfect in that all points are not on a straight line. 
However, the general pattern of the points and the trendline suggest that the overall rela- 
tionship is positive. 

Some general scatter diagram patterns and the types of relationships they suggest are 
shown in Figure 2.19. The top left panel depicts a positive relationship similar to the 
one for the number of commercials and sales example. In the top right panel, the scatter 
diagram shows no apparent relationship between the variables. The bottom panel depicts a 
negative relationship where y tends to decrease as x increases. 


Using Excel to Construct a Scatter Diagram and a Trendline 


We can use Excel to construct a scatter diagram and a trendline for the San Francisco elec- 
tronics store data. 


(@ 


DATA file 


Enter/Access Data: Open the file Electronics. The data are in cells B2:C11 and labels are 
Electronics 


in column A and cells B1:C1. 
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Apply Tools: The following steps describe how to use Excel to construct a scatter diagram 
from the data in the worksheet. 


Step 1. Select cells B1:C11 
Step 2. Click the Insert tab on the Ribbon 
Step 3. In the Charts group, click Insert Scatter (X,Y) or Bubble Chart =~ 
Step 4. When the list of scatter diagram subtypes appears: 
Click Scatter (the chart in the upper left corner) 


The worksheet in Figure 2.20 shows the scatter diagram produced using these steps. 


Editing Options: You can easily edit the scatter diagram to display a different chart 
title, add axis titles, and display the trendline. For instance, suppose you would like to 
use “Scatter Diagram for the San Francisco Electronics Store” as the chart title and insert 
“Number of Commercials” for the horizontal axis title and “Sales ($100s)” for the vertical 
axis title. 


Step 1. Click the Chart Title and replace it with Scatter Chart for the San Francisco 
Electronics Store 
Step 2. Click the Chart Elements button * (located next to the top right corner of the 
chart) 
Step 3. When the list of chart elements appears: 
Select the check box for Axis Titles (creates placeholders for the axis 
titles) 
Deselect the check box for Gridlines (to remove the gridlines from the chart) 
Select the check box for Trendline 
Step 4. Click the horizontal Axis Title placeholder and replace it with No. of 
Commercials 
Step 5. Click the vertical Axis Title placeholder and replace it with Sales ($100s) 
Step 6. To change the trendline from a dashed line to a solid line, right-click on the 
trendline and choose Format Trendline 


Initial Scatter Diagram for the San Francisco Electronics Store Using 


Excel's Recommended Charts Tool 


4, A | B | Ç R E | F ce ee | I | J 
1 | Week No. of Commercials Sales Volume 
2 | 1 2 50 
3 4 : : a Sales Volume 
5| 4 3 BM 
6 | 5 4 54 60 nT ? 
7| 6 1 38 a ° ° 
s| 7 5 63 Pf ° 
9) 8 3 48 40 $ 
10 | 9 4 59 n 
11 10 2 46 
12 | 20 
13 | 10 
14 | 
A 0 1 2 3 4 5 6 
17| 
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FIGURE 2.21 | Edited Scatter Diagram and Trendline for the San Francisco Electronics 
Store Using Excel's Recommended Charts Tool 


A A B Cc D E F G H I J K L 

1| Week | No. of Commercials} Sales Volume 

2 1 2 50 

3 2 5 57 

4 3 1 41 Scatter Chart for the San Francisco 

5| 4 3 54 Electronics Store 

6 5 4 

7 6 1 

8 7 5 > 

9 8 3 2 

10) 9 4 Z 

10 2 Sq 

Z LI 
a 
n 


0 1 2, 3 4 3 6 
No. of Commercials 


Step 7. In the Format Trendline task pane: 
Select the Fill & Line option © 
In the Dash type box, select Solid 
Close the Format Trendline task pane 


The edited scatter diagram and trendline are shown in Figure 2.21. 


Side-by-Side and Stacked Bar Charts 


In Section 2.1 we said that a bar chart is a graphical display for depicting categorical data 
summarized in a frequency, relative frequency, or percent frequency distribution. Side-by- 
side bar charts and stacked bar charts are extensions of basic bar charts that are used to 
display and compare two variables. By displaying two variables on the same chart, we may 
better understand the relationship between the variables. 

A side-by-side bar chart is a graphical display for depicting multiple bar charts on the 
same display. To illustrate the construction of a side-by-side chart, recall the application 
involving the quality rating and meal price data for a sample of 300 restaurants located 
in the Los Angeles area. Quality rating is a categorical variable with rating categories of 
Good, Very Good, and Excellent. Meal Price is a quantitative variable that ranges from $10 
to $49. The crosstabulation displayed in Table 2.10 shows that the data for meal price were 
grouped into four classes: $10-19, $20-29, $30-39, and $40-49. We will use these classes 
to construct a side-by-side bar chart. 

Figure 2.22 shows a side-by-side chart for the restaurant data. The color of each bar 
indicates the quality rating (light blue = good, medium blue = very good, and dark blue = 
excellent). Each bar is constructed by extending the bar to the point on the vertical axis that 
represents the frequency with which that quality rating occurred for each of the meal price 
categories. Placing each meal price category’s quality rating frequency adjacent to one 
another allows us to quickly determine how a particular meal price category is rated. We 
see that the lowest meal price category ($10-$19) received mostly good and very good rat- 
ings, but very few excellent ratings. The highest price category ($4049), however, shows 
a much different result. This meal price category received mostly excellent ratings, some 
very good ratings, but no good ratings. 
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FIGURE 2.22 Side-by-Side Bar Chart for the Quality and Meal Price Data 


Good 
E Very Good 


E Excellent 


Frequency 


10-19 20-29 30-39 40-49 
Meal Price ($) 


Figure 2.22 also provides a good sense of the relationship between meal price and 
quality rating. Notice that as the price increases (left to right), the height of the light blue 
bars decreases and the height of the dark blue bars generally increases. This indicates that 
as price increases, the quality rating tends to be better. The very good rating, as expected, 
tends to be more prominent in the middle price categories as indicated by the dominance of 
the middle bar in the moderate price ranges of the chart. 

Stacked bar charts are another way to display and compare two variables on the same 
display. A stacked bar chart is a bar chart in which each bar is broken into rectangular 
segments of a different color showing the relative frequency of each class in a manner sim- 
ilar to a pie chart. To illustrate a stacked bar chart we will use the quality rating and meal 
price data summarized in the crosstabulation shown in Table 2.10. 

We can convert the frequency data in Table 2.10 into column percentages by dividing 
each element in a particular column by the total for that column. For instance, 42 of the 
78 restaurants with a meal price in the $10-19 range had a good quality rating. In other 
words, (42/78) 100 or 53.8% of the 78 restaurants had a good rating. Table 2.15 shows the 
column percentages for each meal price category. Using the data in Table 2.15 we con- 
structed the stacked bar chart shown in Figure 2.23. Because the stacked bar chart is based 
on percentages, Figure 2.23 shows even more clearly than Figure 2.22 the relationship 
between the variables. As we move from the low price category ($10-—19) to the high price 
category ($40-49), the length of the light blue bars decreases and the length of the dark 
blue bars increases. 


TABLE 2.15 Column Percentages for Each Meal Price Category 


Meal Price 
Quality Rating $10-19 $20-29 $30-39 $40-49 
Good 53.8% 33.9% 2.6% 0.0% 
Very Good 43.6 54.2 60.5 21.4 
Excellent 2.6 11 36.8 78.6 
Total 100.0% 100.0% 100.0% 100.0% 
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FIGURE 2.23 Stacked Bar Chart for the Quality Rating and Meal Price Data 
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Using Excel's Recommended Charts Tool to Construct 
Side-by-Side and Stacked Bar Charts 


In Figure 2.17 we showed the results of using Excel’s PivotTable tool to construct 

a frequency distribution for the sample of 300 restaurants in the Los Angeles area. 
We will use these results to illustrate how Excel’s Recommended Charts tool can be 
used to construct side-by-side and stacked bar charts for the restaurant data using the 
PivotTable output. 


Apply Tools: The following steps describe how to use Excel’s Recommended Charts tool 
to construct a side-by-side bar chart for the restaurant data using the PivotTable tool output 
shown in Figure 2.17. 


Step 1. Select any cell in the PivotTable report (cells A3:F8) 

Step 2. Click the Insert tab on the Ribbon. 

Step 3. In the Charts group click Recommended Charts; a preview showing a bar chart 
with quality rating on the horizontal axis appears 

Step 4. Click OK (alternatively, you can preview a different chart type by selecting one 
of the other chart types listed on the left side of the Insert Chart dialog box) 

Step 5. Click the Design tab on the Ribbon (located under the heading PivotChart 
Tools) 

Step 6. In the Data group click Switch Row/Column; a side-by-side bar chart with meal 
price on the horizontal axis appears 


Excel refers to the bar Editing Options: We can easily edit the side-by-side bar chart to change the bar colors and 
chart in Figure 2.24 as a axis labels. 


Clustered Col hart. 
eek Step 1. Click the Chart Elements button * (located next to the top right corner of the 


chart) 
Step 2. When the list of chart elements appears: 
Select the check box for Axis Titles (creates placeholder for the axis titles) 
Step 3. Click the horizontal Axis Title placeholder and replace it with Meal Price ($) 
Step 4. Click the vertical Axis Title placeholder and replace it with Frequency 
Step 5. Right-click each bar in the chart and change the Fill to the desired color to match 
Figure 2.24. 


The edited side-by-side chart is shown in Figure 2.24. 
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Edited Side-by-Side Chart for the Restaurant Data Constructed Using Excel's 


Recommended Charts Tool 


A B c D E 

Count of Restaurant Column Labels ~ 

Row Labels ~ Good Very Good Excellent Grand Total 
10-19 42 34 2 78 
20-29 40 64 14 118 
30-39 2 46 28 76 
40-49 6 22 28 
Grand Total 84 150 66 300 
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Meal Price ($) ~ 


You can easily change the side-by-side bar chart to a stacked bar chart using the fol- 


lowing steps. 


Step 6. Click the Design tab on the PivotChart Tools Ribbon 
In the Type group, click Change Chart Type 

Step 7. When the Change Chart Type dialog box appears: 
Select the Stacked Columns option #" 


Click OK 


NOTES + COMMENTS 


1. 


(Q 


A time series is a sequence of observations on a variable 
measured at successive points in time or over successive 
periods of time. A scatter diagram in which the value of 
time is shown on the horizontal axis and the time series 
values are shown on the vertical axis is referred to in 
time series analysis as a time series plot. We will discuss 


EXERCISES 


Methods 


time series plots and how to analyze time series data in 
Chapter 17. 


. A stacked bar chart can also be used to display frequencies 


rather than percentage frequencies. In this case, the different 
color segments of each bar represent the contribution to the 
total for that bar, rather than the percentage contribution. 


36. The following 20 observations are for two quantitative variables, x and y. 


Observation x 
1 =22 
2 =35} 
: S 2 
DATA file 7 r 
Scatter 5 TB 
6 2i 
7 113 
8 =23} 
9 14 
10 3 
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y Observation x y 
22 11 =S 48 
49 12 34 =28) 
8 13 9 =} 
=16 14 SS Si 
10 15 20 —16 
T28 16 = 14 
27 17 5 18 
35 18 12 17 
=5 12. z220 =] 
=8) 20 = —22 
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a. Develop a scatter diagram for the relationship between x and y. 
b. What is the relationship, if any, between x and y? 

37. Consider the following data on two categorical variables. The first variable, x, can take 
on values A, B, C, or D. The second variable, y, can take on values I or II. The follow- 
ing table gives the frequency with which each combination occurs. 


y 
x l Il 
A 143 857 
B 200 800 
(E 321 679 
D 420 580 


a. Construct a side-by-side bar chart with x on the horizontal axis. 
b. Comment on the relationship between x and y. 

38. The following crosstabulation summarizes the data for two categorical variables, x and 
y. The variable x can take on values Low, Medium, or High and the variable y can take 
on values Yes or No. 


y 
x Yes No Total 
Low 20 10 30 
Medium 15 35 50 
High 20 5 25 
Total 55 50 105 


a. Compute the row percentages. 
b. Construct a stacked percent frequency bar chart with x on the horizontal axis. 


Applications 
39. Driving Speed and Fuel Efficiency. A study on driving speed (miles per hour) and 
fuel efficiency (miles per gallon) for midsize automobiles resulted in the following 


data: 
Se DATA file Driving Speed | 30 50 40 55 30 25 60 25 50 55 
= ee Fuel Efficiency | 28 25 25 23 30 32 21 35 26 25 
a. Construct a scatter diagram with driving speed on the horizontal axis and fuel effi- 
ciency on the vertical axis. 
b. Comment on any apparent relationship between these two variables. 
SS DATA fi le 40. Low Temperatures and Snowfall. The file Snow contains average annual snowfall 
— (inches) for 51 major U.S. cities over 30 years. For example, the average low temperat- 
Snow ure for Columbus, Ohio, is 44 degrees and the average annual snowfall is 27.5 inches. 


a. Construct a scatter diagram with the average annual low temperature on the hori- 
zontal axis and the average annual snowfall on the vertical axis. 

b. Does there appear to be any relationship between these two variables? 

c. Based on the scatter diagram, comment on any data points that seem unusual. 

41. Hypertension and Heart Disease. People often wait until middle age to worry about 
having a healthy heart. However, recent studies have shown that earlier monitoring of 
risk factors such as blood pressure can be very beneficial (The Wall Street Journal). Hav- 
ing higher than normal blood pressure, a condition known as hypertension, is a major 
risk factor for heart disease. Suppose a large sample of male and female individuals of 
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various ages was selected and that each individual’s blood pressure was measured to 
determine if they have hypertension. For the sample data, the following table shows the 
percentage of individuals with hypertension. 


Age Male Female 


20-34 11.00% 9.00% 

35-44 24.00% 19.00% 
DATA file 45-54 39.00% 37.00% 
Hypertension 55-64 57.00% 56.00% 
65-74 62.00% 64.00% 

75+ 73.30% 79.00% 


(@ 


a. Develop a side-by-side bar chart with age on the horizontal axis, the percentage of 

individuals with hypertension on the vertical axis, and side-by-side bars based on gender. 
b. What does the display you developed in part (a) indicate about hypertension and age? 
c. Comment on differences by gender. 

42. 4K TV. 4K televisions (TVs) contain approximately four times as many pixels to 
display images as a 1080-display TV, and so they offer much higher picture res- 
olution. Suppose that the following survey results show 4K TV ownership compared 
to ownership of 1080 TVs and other types of TVs by age. 


Age Category 4K TV (%) 1080 TV (%) Other (%) 
18-24 49 46 5 


, 25-34 58 35 7 
DATA file 35-44 44 45 1 


Televisions 
45-54 28 58 14 


55-64 22 59 19 
65+ 11 45 44 


(@ 


a. Construct a stacked bar chart to display the above survey data on type of television 
ownership. Use age category as the variable on the horizontal axis. 
b. Comment on the relationship between age and television ownership. 
c. How would you expect the results of this survey to be different if conducted 
10 years from now? 
43. Store Managers Time Study. The Northwest regional manager of an outdoor equip- 
ment retailer conducted a study to determine how managers at three store locations are 
using their time. A summary of the results is shown in the following table. 


Percentage of Manager's Work Week Spent On 
Store Location Meetings Reports Customers Idle 
DATA file Bend 18 11 52 19 
ManagerTime Portland 52 11 24 Is 
Seattle 32 iiy) 3T 14 


(@ 


a. Create a stacked bar chart with store location on the horizontal axis and percentage 
of time spent on each task on the vertical axis. 

b. Create a side-by-side bar chart with store location on the horizontal axis and side- 
by-side bars of the percentage of time spent on each task. 

c. Which type of bar chart (stacked or side-by-side) do you prefer for these data? Why? 
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2.5 Data Visualization: Best Practices in Creating 
Effective Graphical Displays 


Data visualization is a term used to describe the use of graphical displays to summarize 
and present information about a data set. The goal of data visualization is to communicate 
as effectively and clearly as possible the key information about the data. In this section, 
we provide guidelines for creating an effective graphical display, discuss how to select 

an appropriate type of display given the purpose of the study, illustrate the use of data 
dashboards, and show how the Cincinnati Zoo and Botanical Garden uses data visualiza- 
tion techniques to improve decision making. 


Creating Effective Graphical Displays 


The data presented in Table 2.16 show the forecasted or planned value of sales ($1000s) 
and the actual value of sales ($1000s) by sales region in the United States for Gustin 
Chemical for the past year. Note that there are two quantitative variables (planned sales and 
actual sales) and one categorical variable (sales region). Suppose we would like to develop 
a graphical display that would enable management of Gustin Chemical to visualize how 
each sales region did relative to planned sales and simultaneously enable management to 
visualize sales performance across regions. 

Figure 2.25 shows a side-by-side bar chart of the planned versus actual sales data. Note 
how this bar chart makes it very easy to compare the planned versus actual sales in a region, 
as well as across regions. This graphical display is simple, contains a title, is well labeled, 
and uses distinct colors to represent the two types of sales. Note also that the scale of the 
vertical axis begins at zero. The four sales regions are separated by space so that it is clear 


TABLE 2.16 Planned and Actual Sales by Sales Region ($1000s) 


Sales Region Planned Sales ($1000s) Actual Sales ($1000s) 
Northeast 540 447 
Northwest 420 447 
Southeast 575 556 
Southwest 360 341 


FIGURE 2.25 Side-by-Side Bar Chart for Planned Versus Actual Sales 
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that they are distinct, whereas the planned versus actual sales values are side by side for 
easy comparison within each region. The side-by-side bar chart in Figure 2.25 makes it easy 
to see that the Southwest region is the lowest in both planned and actual sales and that the 
Northwest region slightly exceeded its planned sales. 

Creating an effective graphical display is as much art as it is science. By following 
the general guidelines listed below you can increase the likelihood that your display will 
effectively convey the key information in the data. 


e Give the display a clear and concise title. 

e Keep the display simple. Do not use three dimensions when two dimensions are sufficient. 

e Clearly label each axis and provide the units of measure. 

e If color is used to distinguish categories, make sure the colors are distinct. 

e If multiple colors or line types are used, use a legend to define how they are used and 
place the legend close to the representation of the data. 


Choosing the Type of Graphical Display 


In this chapter we discussed a variety of graphical displays, including bar charts, pie charts, 
dot plots, histograms, stem-and-leaf plots, scatter diagrams, side-by-side bar charts, and 
stacked bar charts. Each of these types of displays was developed for a specific purpose. In 
order to provide guidelines for choosing the appropriate type of graphical display, we now 
provide a summary of the types of graphical displays categorized by their purpose. We note 
that some types of graphical displays may be used effectively for multiple purposes. 


Displays Used to Show the Distribution of Data 


e Bar Chart—Used to show the frequency distribution and relative frequency 
distribution for categorical data 

e Pie Chart—Used to show the relative frequency and percent frequency for categorical 
data 

e Dot Plot—Used to show the distribution for quantitative data over the entire range of 
the data 

e Histogram—Used to show the frequency distribution for quantitative data over a set 
of class intervals 

e Stem-and-Leaf Display—Used to show both the rank order and shape of the 
distribution for quantitative data 


Displays Used to Make Comparisons 


e Side-by-Side Bar Chart—Used to compare two variables 
e Stacked Bar Charts—Used to compare the relative frequency or percent frequency of 
two categorical variables 


Displays Used to Show Relationships 


e Scatter Diagram—Used to show the relationship between two quantitative variables 
e Trendline—Used to approximate the relationship of data in a scatter diagram 


Data Dashboards 


Data dashboards are One of the most widely used data visualization tools is a data dashboard. If you drive 
also referred to as digital a car, you are already familiar with the concept of a data dashboard. In an automobile, 
dashboards. the car’s dashboard contains gauges and other visual displays that provide the key 


information that is important when operating the vehicle. For example, the gauges used 
to display the car’s speed, fuel level, engine temperature, and oil level are critical to 
ensure safe and efficient operation of the automobile. In some vehicles, this information 
is even displayed visually on the windshield to provide an even more effective display 
for the driver. Data dashboards play a similar role for managerial decision making. 
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Key performance indicators A data dashboard is a set of visual displays that organizes and presents information that 
are sometimes referred to is used to monitor the performance of a company or organization in a manner that is easy 
as key performance metrics to read, understand, and interpret. Just as a car’s speed, fuel level, engine temperature, and 
(KPMs). oil level are important information to monitor in a car, every business has key perform- 
ance indicators (KPIs) that need to be monitored to assess how a company is performing. 
Examples of KPIs are inventory on hand, daily sales, percentage of on-time deliveries, and 
sales revenue per quarter. A data dashboard should provide timely summary information 
(potentially from various sources) on KPIs that is important to the user, and it should do so 
in a manner that informs rather than overwhelms its user. 
To illustrate the use of a data dashboard in decision making, we will discuss an ap- 
plication involving the Grogan Oil Company. Grogan has offices located in three cities 
in Texas: Austin (its headquarters), Houston, and Dallas. Grogan’s Information Techno- 
logy (IT) call center, located in the Austin office, handles calls from employees regarding 
computer-related problems involving software, Internet, and e-mail issues. For example, if 
a Grogan employee in Dallas has a computer software problem, the employee can call the 
IT call center for assistance. 
The data dashboard shown in Figure 2.26 was developed to monitor the performance of 
DATA file the call center. This data dashboard combines several displays to monitor the call center’s 
KPIs. The data presented are for the current shift, which started at 8:00 a.m. The stacked 
bar chart in the upper left-hand corner shows the call volume for each type of problem 


(@ 


Grogan 


FIGURE 2.26 Grogan Oil Information Technology Call Center Data Dashboard 
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(software, Internet, or e-mail) over time. This chart shows that call volume is heavier dur- 
ing the first few hours of the shift, calls concerning e-mail issues appear to decrease over 
time, and volume of calls regarding software issues are highest at midmorning. 

The bar chart in the upper right-hand corner of the dashboard shows the percentage 
of time that call center employees spent on each type of problem or were idle (not 
working on a call). These top two charts are important displays in determining the 
optimal staffing levels. For instance, knowing the call mix and how stressed the system 
is, as measured by percentage of idle time, can help the IT manager make sure that 
enough call center employees are available with the right level of expertise. 

The side-by-side bar chart title “Call Volume by Office” shows the call volume by type 
of problem for each of Grogan’s offices. This allows the IT manager to quickly identify 
if there is a particular type of problem by location. For example, it appears that the office 
in Austin is reporting a relatively high number of issues with e-mail. If the source of the 
problem can be identified quickly, then the problem for many might be resolved quickly. 
Also, note that a relatively high number of software problems are coming from the Dallas 
office. The higher call volume in this case was simply due to the fact that the Dallas office 
is currently installing new software, and this has resulted in more calls to the IT call center. 
Because the IT manager was alerted to this by the Dallas office last week, the IT manager 
knew there would be an increase in calls coming from the Dallas office and was able to 
increase staffing levels to handle the expected increase in calls. 

For each unresolved case that was received more than 15 minutes ago, the bar chart 
shown in the middle left-hand side of the data dashboard displays the length of time that 
each of these cases has been unresolved. This chart enables Grogan to quickly monitor the 
key problem cases and decide whether additional resources may be needed to resolve them. 
The worst case, T57, has been unresolved for over 300 minutes and is actually left over 
from the previous shift. Finally, the histogram at the bottom shows the distribution of the 
time to resolve the problem for all resolved cases for the current shift. 

The Grogan Oil data dashboard illustrates the use of a dashboard at the operational 
level. The data dashboard is updated in real time and used for operational decisions such 
as staffing levels. Data dashboards may also be used at the tactical and strategic levels of 
management. For example, a logistics manager might monitor KPIs for on-time perform- 
ance and cost for its third-party carriers. This could assist in tactical decisions such as 
transportation mode and carrier selection. At the highest level, a more strategic dashboard 
would allow upper management to quickly assess the financial health of the company by 
monitoring more aggregate financial, service level, and capacity utilization information. 

The guidelines for good data visualization discussed previously apply to the indi- 
vidual charts in a data dashboard, as well as to the entire dashboard. In addition to those 
guidelines, it is important to minimize the need for screen scrolling, avoid unnecessary use 
of color or three-dimensional displays, and use borders between charts to improve readabil- 
ity. As with individual charts, simpler is almost always better. 


Data Visualization in Practice: Cincinnati Zoo 
and Botanical Garden? 


The Cincinnati Zoo and Botanical Garden, located in Cincinnati, Ohio, is the second oldest 
zoo in the world. In order to improve decision making by becoming more data-driven, 
management decided they needed to link together the different facets of their business and 
provide nontechnical managers and executives with an intuitive way to better understand 
their data. A complicating factor is that when the zoo is busy, managers are expected to 

be on the grounds interacting with guests, checking on operations, and anticipating issues 
as they arise or before they become an issue. Therefore, being able to monitor what is 
happening on a real-time basis was a key factor in deciding what to do. Zoo management 
concluded that a data visualization strategy was needed to address the problem. 


The authors are indebted to John Lucas of the Cincinnati Zoo and Botanical Garden for providing this application. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


2.5 Data Visualization: Best Practices in Creating Effective Graphical Displays 89 


FIGURE 2.27 Data Dashboard for the Cincinnati Zoo 
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Because of its ease of use, real-time updating capability, and iPad compatibility, the 
Cincinnati Zoo decided to implement its data visualization strategy using IBM’s Cognos 
advanced data visualization software. Using this software, the Cincinnati Zoo developed 
the data dashboard shown in Figure 2.27 to enable zoo management to track the following 
key performance indicators: 


Item Analysis (sales volumes and sales dollars by location within the zoo) 
Geoanalytics (using maps and displays of where the day’s visitors are spending their 
time at the zoo) 

Customer Spending 

Cashier Sales Performance 

Sales and Attendance Data Versus Weather Patterns 

Performance of the Zoo’s Loyalty Rewards Program 


An iPad mobile application was also developed to enable the zoo’s managers to be out 
on the grounds and still see and anticipate what is occurring on a real-time basis. The 
Cincinnati Zoo’s iPad data dashboard, shown in Figure 2.28, provides managers with ac- 
cess to the following information: 


e Real-time attendance data, including what types of guests are coming to the zoo 
e Real-time analysis showing which items are selling the fastest inside the zoo 
e Real-time geographical representation of where the zoo’s visitors live 


Having access to the data shown in Figures 2.27 and 2.28 allows the zoo managers to make 
better decisions on staffing levels within the zoo, which items to stock based upon weather 
and other conditions, and how to better target its advertising based on geodemographics. 
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FIGURE 2.28 The Cincinnati Zoo iPad Data Dashboard 
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The impact that data visualization has had on the zoo has been substantial. Within 
the first year of use, the system was directly responsible for revenue growth of over 
$500,000, increased visitation to the zoo, enhanced customer service, and reduced 


marketing costs. 


NOTES + COMMENTS 


1. A variety of software is available for data visualization. 
Among the more popular packages are Excel, Power BI, 
JMP, R, SAS Visual Analytics, Spottire, and Tableau. 

. A very powerful tool for visualizing geographic data is a 
Geographic Information System (GIS). A GIS uses color, 
symbols, and text on a map to help you understand how 
variables are distributed geographically. For example, a 


company interested in trying to locate a new distribution 


center might wish to better understand how the demand 
for its product varies throughout the United States. A GIS 
can be used to map the demand, with red regions indic- 
ating high demand, blue lower demand, and no color in- 
dicating regions where the product is not sold. Locations 
closer to red high-demand regions might be good candid- 
ate sites for further consideration. 


SUMMARY 


A set of data, even if modest in size, is often difficult to interpret directly in the form in 
which it is gathered. Tabular and graphical displays can be used to summarize and present 
data so that patterns are revealed and the data are more easily interpreted. Frequency dis- 
tributions, relative frequency distributions, percent frequency distributions, bar charts, and 
pie charts were presented as tabular and graphical displays for summarizing the data for a 
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FIGURE 2.29 Tabular and Graphical Displays for Summarizing Data 
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single categorical variable. Frequency distributions, relative frequency distributions, percent 
frequency distributions, histograms, cumulative frequency distributions, cumulative relative 
frequency distributions, cumulative percent frequency distributions, and stem-and-leaf dis- 
plays were presented as ways of summarizing the data for a single quantitative variable. 

A crosstabulation was presented as a tabular display for summarizing the data for two 
variables, and a scatter diagram was introduced as a graphical display for summarizing the 
data for two quantitative variables. We also showed that side-by-side bar charts and stacked 
bar charts are just extensions of basic bar charts that can be used to display and compare 
two categorical variables. Guidelines for creating effective graphical displays and how to 
choose the most appropriate type of display were discussed. Data dashboards were intro- 
duced to illustrate how a set of visual displays can be developed that organizes and presents 
information that is used to monitor a company’s performance in a manner that is easy to 
read, understand, and interpret. Figure 2.29 provides a summary of the tabular and graph- 
ical methods presented in this chapter. 


GLOSSARY 


Bar chart A graphical device for depicting categorical data that have been summarized in 
a frequency, relative frequency, or percent frequency distribution. 

Categorical data Labels or names used to identify categories of like items. 

Class midpoint The value halfway between the lower and upper class limits. 
Crosstabulation A tabular summary of data for two variables. The classes for one variable 
are represented by the rows; the classes for the other variable are represented by the columns. 
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Cumulative frequency distribution A tabular summary of quantitative data showing 

the number of data values that are less than or equal to the upper class limit of each 

class. 

Cumulative percent frequency distribution A tabular summary of quantitative data 
showing the percentage of data values that are less than or equal to the upper class limit of 
each class. 

Cumulative relative frequency distribution A tabular summary of quantitative data 
showing the fraction or proportion of data values that are less than or equal to the upper 
class limit of each class. 

Data dashboard A set of visual displays that organizes and presents information that is 
used to monitor the performance of a company or organization in a manner that is easy to 
read, understand, and interpret. 

Data visualization A term used to describe the use of graphical displays to summarize 
and present information about a data set. 

Dot plot A graphical device that summarizes data by the number of dots above each data 
value on the horizontal axis. 

Frequency distribution A tabular summary of data showing the number (frequency) of 
observations in each of several nonoverlapping categories or classes. 

Histogram A graphical display of a frequency distribution, relative frequency distribution, 
or percent frequency distribution of quantitative data constructed by placing the class inter- 
vals on the horizontal axis and the frequencies, relative frequencies, or percent frequencies 
on the vertical axis. 

Percent frequency distribution A tabular summary of data showing the percentage of 
observations in each of several nonoverlapping classes. 

Pie chart A graphical device for presenting data summaries based on subdivision of a 
circle into sectors that correspond to the relative frequency for each class. 

Quantitative data Numerical values that indicate how much or how many. 

Relative frequency distribution A tabular summary of data showing the fraction or pro- 
portion of observations in each of several nonoverlapping categories or classes. 

Scatter diagram A graphical display of the relationship between two quantitative vari- 
ables. One variable is shown on the horizontal axis and the other variable is shown on the 
vertical axis. 

Side-by-side bar chart A graphical display for depicting multiple bar charts on the same 
display. 

Simpson’s paradox Conclusions drawn from two or more separate crosstabulations that 
can be reversed when the data are aggregated into a single crosstabulation. 

Stacked bar chart A bar chart in which each bar is broken into rectangular segments of 
a different color showing the relative frequency of each class in a manner similar to a pie 
chart. 

Stem-and-leaf display A graphical display used to show simultaneously the rank order 
and shape of a distribution of data. 

Trendline A line that provides an approximation of the relationship between two 
variables. 


Relative Frequency 


Frequency of the class 


n 


Approximate Class Width 


Largest data value — Smallest data value 


(2.2) 
Number of classes 
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SUPPLEMENTARY EXERCISES 


44. SAT Scores. The SAT is a standardized test used by many colleges and universities 
in admission decisions. More than 1.8 million high school students take the SAT each 
year. The 2019 version of the SAT is composed of two sections: evidence-based read- 
ing and writing, and math. A perfect combined score for the SAT is 1600. A sample of 
SAT scores for the combined SAT is as follows. 


Zs . 910 1040 910 1110 900 
S DATA file 1200 1160 1020 870 1270 
oe 940 1420 740 1260 1080 
1060 920 750 1310 1210 
1180 1080 1380 1030 820 
920 600 1040 850 800 
a. Show a frequency distribution and histogram. Begin with the first class starting at 
600 and use a class width of 100. 
b. Comment on the shape of the distribution. 
c. What other observations can be made about the SAT scores based on the tabular 
and graphical summaries? 
> i 45. Median Household Incomes. The file MedianHousehold contains the median 
= DATA f ile household income for a family with two earners for each of the fifty states (American 
MedianHousehold Community Survey, 2013). 
a. Construct a frequency and a percent frequency distribution of median household 
income. Begin the first class at 65.0 and use a class width of 5. 
b. Construct a histogram. 
c. Comment on the shape of the distribution. 
d. Which state has the highest median income for two-earner households? 
e. Which state has the lowest median income for two-earner households? 
46. State Populations. Data showing the population by state as of 2012 in millions of people 
follow (The World Almanac). 
State Population State Population State Population 
Alabama 4.8 Louisiana 4.5 Ohio MS 
Alaska OW, Maine les Oklahoma 3.8 
Arizona 6.4 Maryland 5.8 Oregon 4.3 
GP F Arkansas 29) Massachusetts 6.5 Pennsylvania 127 
= DATA file California S73 Michigan 99 ae 1.0 
Population2012 Colorado 5.0 Minnesota 53 South Carolina 4.6 
Connecticut 3.6 Mississippi 3.0 South Dakota 0.8 
Delaware 0.9 Missouri 6.0 Tennessee 6.3 
Florida 18.8 Montana 0.9 Texas 251 
Georgia i Nebraska 1.8 Utah 2.8 
Hawaii 1.4 Nevada 2 Vermont 0.6 
Idaho 1.6 New Hampshire es Virginia 8.0 
Illinois 12.8 New Jersey 8.8 Washington 6.7 
Indiana 6.5 New Mexico 2.0 West Virginia 19 
lowa 3.0 New York 19.4 Wisconsin S 
Kansas 2) North Carolina 25 Wyoming 0.6 
Kentucky 4.3 North Dakota 0.7 


a. Develop a frequency distribution, a percent frequency distribution, and a histogram. 
Use a class width of 2.5 million. 

b. Does there appear to be any skewness in the distribution? Explain. 

c. What observations can you make about the population of the 50 states? 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


© 
A 


(@ 


(@ 


(Q 


Chapter 2 Descriptive Statistics: Tabular and Graphical Displays 


47. Startup Company Funds. According to The Wall Street Journal, a startup company’s 
ability to gain funding is a key to success. The funds raised (in millions of dollars) by 
50 startup companies follow. 
i 81 61 103 166 168 
DATA file 80 51 130 77 78 
StartUps 69 119 81 60 20 
73 50 110 21 60 
192 18 54 49 63 
91 272 58 54 40 
47 24 57 78 78 
154 72 38 131 52 
48 118 40 49 55 
54 112 129 156 31 
a. Construct a stem-and-leaf display. 
b. Comment on the display. 
; 48. Complaints Reported to BBB. Consumer complaints are frequently reported to the 
DATA f ile Better Business Bureau (BBB). Some industries against whom the most complaints 
BBB are reported to the BBB are banks, cable and satellite television companies, collection 
agencies, cellular phone providers, and new car dealerships (USA Today). The results 
for a sample of 200 complaints are contained in the file BBB. 
a. Show the frequency and percent frequency of complaints by industry. 
b. Construct a bar chart of the percent frequency distribution. 
c. Which industry had the highest number of complaints? 
d. Comment on the percentage frequency distribution for complaints. 
49. Stock Price Volatility. The term “beta” refers to a measure of a stock’s price volatil- 


ity relative to the stock market as a whole. A beta of 1 means the stock’s price moves 
exactly with the market. A beta of 1.6 means the stock’s price would increase by 1.6% 
for an increase of 1% in the stock market. A larger beta means the stock price is more 
volatile. The beta values for the stocks of the companies that make up the Dow Jones 
Industrial Average are shown in Table 2.17 (Yahoo Finance). 

a. Construct a frequency distribution and percent frequency distribution. 

b. Construct a histogram. 


TABLE 2.17 Betas for Dow Jones Industrial Average Companies 


Company Beta Company Beta 
. American Express Company 1.24 3M Company 1.23 
DATA file The Boeing Company GS) Merck & Co. Inc. 56 
StocksBetá Caterpillar Inc. 12 Microsoft Corporation 69 
Cisco Systems, Inc. 1236 Nike, Inc. 47 
Chevron Corporation Veil Pfizer Inc. 72 
Dow, Inc. 1.36 The Procter & Gamble Company 73 
The Walt Disney Company E97 AT&T, Inc. 18 
The Goldman Sachs Group, Inc. 1.79. The Travelers Companies, Inc. 86 
The Home Depot, Inc. 122 UnitedHealth Group Incorporated 88 
International Business Machines Corporation 92 United Technologies Corporation 1.22 
Intel Corporation 9 Visa Inc. 82 
Johnson & Johnson 84 Verizon Communications Inc. .04 
JPMorgan Chase & Co. 1.84 Walgreens Boots Alliance 81 
The Coca-Cola Company .68 Walmart Stores Inc. 26 
McDonald's Corp. 62 Exxon Mobil Corporation Ii 
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50. 


c. Comment on the shape of the distribution. 
d. Which stock has the highest beta? Which has the lowest beta? 


Education Level and Household Income. The U.S. Census Bureau serves as the 
leading source of quantitative data about the nation’s people and economy. The 
following crosstabulation shows the number of households (1000s) and the house- 
hold income by the level of education for heads of household having received a 
high school degree or more education (U.S. Census Bureau website). 


Household Income 


Under $25,000 to $50,000 to $100,000 
Level of Education $25,000 $49,999 $99,999 and Over Total 
High School Graduate 9,880 9,970 9,441 3,482 BQ TTS 
Bachelor's Degree 2,484 4,164 7,666 U7 22m2 
Master's Degree 685 1,205 SONG 4,094 9,003 
Doctoral Degree VT 160 422 1,076 | 7! 
Total 13,128 1599 20,548 16,469 65,644 


SI. 


a. Construct a percent frequency distribution for the level of education variable. What 
percentage of heads of households have a master’s or doctoral degree? 

b. Construct a percent frequency distribution for the household income variable. What 
percentage of households have an income of $50,000 or more? 

c. Convert the entries in the crosstabulation into column percentages. Compare the 
level of education of households with a household income of under $25,000 to the 
level of education of households with a household income of $100,000 or more. 
Comment on any other items of interest when reviewing the crosstabulation show- 
ing column percentages. 

Softball Players’ Batting Averages. Western University has only one women’s soft- 

ball scholarship remaining for the coming year. The final two players that Western is 

considering are Allison Fealey and Emily Janson. The coaching staff has concluded 
that the speed and defensive skills are virtually identical for the two players, and 
that the final decision will be based on which player has the best batting average. 

Crosstabulations of each player’s batting performance in their junior and senior 

years of high school are as follows: 


Allison Fealey Emily Janson 
Outcome Junior Senior Outcome Junior Senior 
Hit 15 75 Hit 70 35 
No Hit 25 175 No Hit 130 85 
Total At-Bats 40 250 Total At Bats 200 120 


A player’s batting average is computed by dividing the number of hits a player has by 

the total number of at-bats. Batting averages are represented as a decimal number with 

three places after the decimal. 

a. Calculate the batting average for each player in her junior year. Then calculate the 
batting average of each player in her senior year. Using this analysis, which player 
should be awarded the scholarship? Explain. 
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b. Combine or aggregate the data for the junior and senior years into one crosstabula- 
tion as follows: 


Player 
Outcome Fealey Janson 
Hit 
No Hit 


Total At-Bats 


Calculate each player’s batting average for the combined two years. Using this 
analysis, which player should be awarded the scholarship? Explain. 

c. Are the recommendations you made in parts (a) and (b) consistent? Explain any 
apparent inconsistencies. 


52. Best Places to Work. Fortune magazine publishes an annual survey of the 100 best 


companies to work for. The data in the file FortuneBest100 shows the rank, company 

name, the size of the company, and the percentage job growth for full-time employ- 

ees for 98 of the Fortune 100 companies for which percentage job growth data were 
available (Fortune magazine website). The column labeled “Rank” shows the rank 

of the company in the Fortune 100 list; the column labeled “Size” indicates whether 

the company is a small company (less than 2,500 employees), a midsized company 

(2,500 to 10,000 employees), or a large company (more than 10,000 employees); and 

the column labeled “Job Growth (%)” shows the percentage job growth for full-time 

employees. 

a. Construct a crosstabulation with Job Growth (%) as the row variable and Size as the 
column variable. Use classes starting at — 10 and ending at 70 in increments of 10 
for Job Growth. 

b. Show the frequency distribution for Job Growth (%) and the frequency distribution for 
Size. 

c. Using the crosstabulation constructed in part (a), develop a crosstabulation showing 
column percentages. 

d. Using the crosstabulation constructed in part (a), develop a crosstabulation showing 
row percentages. 

e. Comment on the relationship between the percentage job growth for full-time em- 
ployees and the size of the company. 


. Colleges’ Year Founded and Cost. Table 2.18 shows a portion of the data for a 


sample of 103 private colleges and universities. The complete data set is contained 
in the file Colleges. The data include the name of the college or university, the year 
the institution was founded, the tuition and fees (not including room and board) for 


TABLE 2.18 Data for a Sample of Private Colleges and Universities 


School Year Founded Tuition & Fees % Graduate 
American University 1893 $36,697 79.00 
Baylor University 1845 $29,754 70.00 
Belmont University 1951 $23,680 68.00 
Wofford College 1854 $31,710 82.00 
Xavier University 1831 $29,970 79.00 
Yale University 1701 $38,300 98.00 
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the most recent academic year, and the percentage of full-time, first-time bachelor’s 

degree-seeking undergraduate students who obtain their degree in six years or less 

(The World Almanac). 

a. Construct a crosstabulation with Year Founded as the row variable and Tuition & 
Fees as the column variable. Use classes starting with 1600 and ending with 2000 
in increments of 50 for Year Founded. For Tuition & Fees, use classes starting with 
1 and ending with 45,000 in increments of 5,000. 

b. Compute the row percentages for the crosstabulation in part (a). 

c. What relationship, if any, do you notice between Year Founded and Tuition & 
Fees? 

54. Colleges’ Year Founded and Percent Graduated. Refer to the data set in Table 2.18. 
a. Construct a crosstabulation with Year Founded as the row variable and % Graduate 

as the column variable. Use classes starting with 1600 and ending with 2000 in 
increments of 50 for Year Founded. For % Graduate, use classes starting with 35% 
and ending with 100% in increments of 5%. 

b. Compute the row percentages for your crosstabulation in part (a). 

c. Comment on any relationship between the variables. 

55. Colleges’ Year Founded and Cost. Refer to the data set in Table 2.18. 

a. Construct a scatter diagram to show the relationship between Year Founded and 
Tuition & Fees. 

b. Comment on any relationship between the variables. 

56. Colleges’ Cost and Percent Graduated. Refer to the data set in Table 2.18. 

a. Prepare a scatter diagram to show the relationship between Tuition & Fees and 
% Graduate. 

b. Comment on any relationship between the variables. 

57. Electric Vehicle Sales. Electric plug-in vehicle sales have been increasing 
worldwide. The table below displays data collected by the U.S. Department of 
Energy on electric plug-in vehicle sales in the world’s top markets in 2013 and 
2015. (Data compiled by Argonne National Laboratory, U.S. Department of 
Energy website, https://www.energy.gov/eere/vehicles/fact-9 1 8-march-28-2016 
-global-plug-light-vehicle-sales-increased-about-80-2015) 


I . 
Sæ DATA file Region 2013 2015 
FlgctricVehictes China 15,004 214,283 
Western Europe 7223 184,500 
United States 02 115,262 
Japan 28,716 46,339 
Canada 931 5,284 


a. Construct a side-by-side bar chart with year as the variable on the horizontal axis. 
Comment on any trend in the display. 

b. Convert the above table to percentage allocation for each year. Construct a stacked 
bar chart with year as the variable on the horizontal axis. 

c. Is the display in part (a) or part (b) more insightful? Explain. 

58. Zoo Member Types and Attendance. A zoo has categorized its visitors into three 
categories: member, school, and general. The member category refers to visitors 
who pay an annual fee to support the zoo. Members receive certain benefits such as 
discounts on merchandise and trips planned by the zoo. The school category includes 
faculty and students from day care and elementary and secondary schools; these 
visitors generally receive a discounted rate. The general category includes all other 
visitors. The zoo has been concerned about a recent drop in attendance. To help 
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better understand attendance and membership, a zoo staff member has collected the 
following data: 


(@ 


Attendance 
Visitor Category Year 1 Year 2 Year 3 Year 4 
. General 1537703 158,704 163,433 169,106 
DATA file Member 115,523 104,795 98,437 81,217 
Zoo School 82,885 79,876 81,970 81,290 
Total 3121 343,375 343,840 331,613 
a. Construct a bar chart of total attendance over time. Comment on any trend in the data. 
b. Construct a side-by-side bar chart showing attendance by visitor category with year 
as the variable on the horizontal axis. 
c. Comment on what is happening to zoo attendance based on the charts from parts (a) 
and (b). 
CASE PROBLEM 1: PELICAN STORES 
Pelican Stores, a division of National Clothing, is a chain of women’s apparel stores operat- 
ing throughout the country. The chain recently ran a promotion in which discount coupons 
were sent to customers of other National Clothing stores. Data collected for a sample of 100 
in-store credit card transactions at Pelican Stores during one day while the promotion was 
running are contained in the file PelicanStores. Table 2.19 shows a portion of the data set. 
The Proprietary Card method of payment refers to charges made using a National Cloth- 
ing charge card. Customers who made a purchase using a discount coupon are referred to 
as promotional customers, and customers who made a purchase but did not use a discount 
coupon are referred to as regular customers. Because the promotional coupons were not 
sent to regular Pelican Stores customers, management considers the sales made to people 
presenting the promotional coupons as sales it would not otherwise make. Of course, 
Pelican also hopes that the promotional customers will continue to shop at its stores. 
Most of the variables shown in Table 2.19 are self-explanatory, but two of the variables 
require some clarification. 
Items The total number of items purchased 
Net Sales The total amount ($) charged to the credit card 
TABLE 2.19 Data for a Sample of 100 Credit Card Purchases at Pelican Stores 
Type of Method of Marital 
Customer Customer Items Net Sales Payment Gender Status Age 
1 Regular 1 39.50 Discover Male Married 32 
2 Promotional 1 102.40 Proprietary Card Female Married 36 
; 3 Regular 1 22.50 Proprietary Card Female Married 32 
DATA file 4 Promotional 5 100.40 Proprietary Card Female Married 28 
PelicanStores 5 Regular 2 54.00 MasterCard Female Married 34 
96 Regular 1 39.50 MasterCard Female Married 44 
97 Promotional 9 253.00 Proprietary Card Female Married 30 
98 Promotional 10 287.59 Proprietary Card Female Married 52 
99 Promotional 2 47.60 Proprietary Card Female Married 30 
100 Promotional 1 28.44 Proprietary Card Female Married 44 
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Pelican’s management would like to use this sample data to learn about its customer 
base and to evaluate the promotion involving discount coupons. 


Managerial Report 

Use the tabular and graphical methods of descriptive statistics to help management develop 
a customer profile and to evaluate the promotional campaign. At a minimum, your report 
should include the following: 


1. Percent frequency distribution for key variables. 

2. A bar chart or pie chart showing the number of customer purchases attributable to the 
method of payment. 

3. A crosstabulation of type of customer (regular or promotional) versus net sales. Com- 
ment on any similarities or differences present. 

4. A scatter diagram to explore the relationship between net sales and customer age. 


IE THEATER RELEASES 


eeceeoeeeeeeeeeeeeeeeeed eeece 


The movie industry is a competitive business. More than 50 studios produce hundreds of 
new movies for theater release each year, and the financial success of each movie varies 
considerably. The opening weekend gross sales ($ millions), the total gross sales ($ mil- 
lions), the number of theaters the movie was shown in, and the number of weeks the movie 
was in release are common variables used to measure the success of a movie released to 
theaters. Data collected for the top 100 theater movies released in 2016 are contained in 
the file Movies2016 (Box Office Mojo website). Table 2.20 shows the data for the first 10 
movies in this file. 


Managerial Report 
Use the tabular and graphical methods of descriptive statistics to learn how these variables 
contribute to the success of a motion picture. Include the following in your report. 


1. Tabular and graphical summaries for each of the four variables along with a discussion 
of what each summary tells us about the movies that are released to theaters. 

2. A scatter diagram to explore the relationship between Total Gross Sales and Opening 
Weekend Gross Sales. Discuss. 


TABLE 2.20 Performance Data for Ten 2016 Movies Released to Theaters 


Opening Total Gross 
Gross Sales Sales Number of Weeks in 
Movie Title ($ millions) ($ millions) Theaters Release 
Rogue One: A Star 155.08 532.18 4,157 20 
Wars Story 
Finding Dory 135.06 486.30 4,305 25 
Captain America: 179.14 408.08 4,226 20 
Civil War 
The Secret Life of Pets 104.35 368.38 4,381 25 
S DATA file The Jungle Book 103.26 364.00 4,144 24 
Movies2016 Deadpool 132.43 363.07 3,856 18 
Zootopia 75.06 341.27 37252 22 
Batman v Superman: 166.01 330.36 4,256 12 
Dawn of Justice 
Suicide Squad 133.68 325.10 4,255 14 
Sing 35.26 270.40 4,029 20 
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3. A scatter diagram to explore the relationship between Total Gross Sales and Number of 
Theaters. Discuss. 

4. A scatter diagram to explore the relationship between Total Gross Sales and Number of 
Weeks in Release. Discuss. 


CASE PROBLEM 3: QUEEN CITY 


@eeeeeeeeeeeeeeeeeeeeoeeeeeeeeeeeeeeeeeeeeeseeeeeeeeoeeeeee8800 


Cincinnati, Ohio, also known as the Queen City, has a population of just over 300,000 and 
is the third largest city in the state of Ohio. The Cincinnati metropolitan area has a popula- 
tion of about 2.2 million. The city is governed by a mayor and a nine-member city council. 
The city manager, who is responsible for the day-to-day operation of the city, reports to the 
mayor and city council. The city manager recently created the Office of Performance and 
Data Analytics with the goal of improving the efficiency of city operations. One of the first 
tasks of this new office is to review the previous year’s expenditures. The file QueenCity 
contains data on the previous year’s expenditures, including the following: 


Department The number of the department incurring the expenditure 
Department Description The name of the department incurring the description 
Category The category of the expenditure 

Fund The fund to which the expenditure was charged 

Expenditure The dollar amount of the expense 


DATA fi le Table 2.21 shows the first four entries of the 5427 expenditures for the year. The city man- 
ager would like to use this data to better understand how the city’s budget is being spent. 


(Q 


QueenCity 


Managerial Report 


Use tabular and graphical methods of descriptive statistics to help the city manager get a 
better understanding of how the city is spending its funding. Your report should include the 
following: 


1. Tables and/or graphical displays that show the amount of expenditures by category and 
percentage of total expenditures by category. 

2. A table that shows the amount of expenditures by department and the percentage of total 
expenditures by department. Combine any department with less than 1% into a category 
named “Other.” 

3. A table that shows the amount of expenditures by fund and the percentage of total ex- 
penditures by fund. Combine any fund with less than 1% into a category named “Other.” 


CASE PROBLEM 4: CUT-RATE MACHININ 


SCHSOSSHSSSSSHSHSSHSSHSSHSHSSHSHSHSSSHESHSSHSHEOSSSEESESSEEEEE eee 


G, INC. 


eeeoeoeeeeneeee 


Jon Weideman, first shift foreman for Cut-Rate Machining, Inc., is attempting to decide 
on a vendor from whom to purchase a drilling machine. He narrows his alternatives to 
four vendors: The Hole-Maker, Inc. (HM); Shafts & Slips, Inc. (SS); Judge’s Jigs (JJ); and 
Drill-for-Bits, Inc. (DB). Each of these vendors is offering machines of similar capabilities 


TABLE 2.21 Annual Expenditures for Queen City (First Four Entries) 


Department Department Description Category Fund Expenditure 
121 Department of Human Resources Fringe Benefits 050-GENERALFUND §$ 7,085.21 
121 Department of Human Resources Fringe Benefits 050 - GENERAL FUND $102,678.64 
121 Department of Human Resources Fringe Benefits 050 - GENERAL FUND $ 79,112.85 
121 Department of Human Resources Contractual Services 050- GENERAL FUND $ 3,572.50 
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at similar prices, so the effectiveness of the machines is the only selection criterion that 
Mr. Weideman can use. He invites each vendor to ship one machine to his Richmond, 
Indiana, manufacturing facility for a test. He starts all four machines at 8:00 A.M. and lets 
them warm up for two hours before starting to use any of the machines. After the warmup 
period, one of his employees will use each of the shipped machines to drill 3-centimeter- 
diameter holes in 25-centimeter-thick stainless-steel sheets for two hours. The widths 

of holes drilled with each machine are then measured and recorded. The results of 

Mr. Weideman’s data collection are shown in Table 2.22. 


TABLE 2.22 Data Collected for Drill-For-Bits, Inc. Vendor Selection 


Measured 
Shift Time Period Employee Vendor Width (cm) 
1 10:00 a.m. — noon Ms. Ames HM 3.50 
1 10:00 a.m. — noon Ms. Ames HM 3.13 
1 10:00 a.m. — noon Ms. Ames HM 339. 
1 10:00 a.m. — noon Ms. Ames HM 3.08 
1 10:00 a.m. — noon Ms. Ames HM 3.22 
1 10:00 a.m. — noon Ms. Ames HM 3.45 
1 10:00 a.m. — noon Ms. Ames HM 3732 
íl 10:00 a.m. — noon Ms. Ames HM 3.61 
1 10:00 a.m. — noon Ms. Ames HM 3.10 
1 10:00 a.m. — noon Ms. Ames HM 3.03 
1 10:00 a.m. — noon Ms. Ames HM 3.67 
í 10:00 a.m. — noon Ms. Ames HM 359. 
1 10:00 a.m. — noon Ms. Ames HM 338 
1 10:00 a.m. — noon Ms. Ames HM S07 
1 10:00 a.m. — noon Ms. Ames HM 3.55 
1 10:00 a.m. — noon Ms. Ames HM 3.00 
1 Noon — 2:00 p.m. Ms. Ames SS 2.48 
1 Noon — 2:00 p.m. Ms. Ames SS AWE 
1 Noon — 2:00 p.m. Ms. Ames SS 2199, 
1 Noon — 2:00 p.m. Ms. Ames SS 2.68 
1 Noon — 2:00 p.m. Ms. Ames SS 275 
1 Noon — 2:00 p.m. Ms. Ames SS 2.42 
1 Noon — 2:00 p.m. Ms. Ames SS 292 
1 Noon — 2:00 p.m. Ms. Ames SS 2.68 
1 Noon — 2:00 p.m. Ms. Ames SS 2.98 
1 Noon — 2:00 p.m. Ms. Ames SS 2.50 
1 Noon — 2:00 p.m. Ms. Ames SS 2.45 
1 Noon — 2:00 p.m. Ms. Ames SS 2.99. 
1 Noon — 2:00 p.m. Ms. Ames SS AS 
1 Noon — 2:00 p.m. Ms. Ames SS 2.42 
1 Noon — 2:00 p.m. Ms. Ames SS AS) 
1 Noon — 2:00 p.m. Ms. Ames SS 2.83 
1 2:00 p.m. — 4:00 P.M. Ms. Ames JJ 2.66 
(Continued) 
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TABLE 2.22 (Continued) 


Measured 
Shift Time Period Employee Vendor Width (cm) 
1 2:00 p.m. — 4:00 P.M. Ms. Ames yg 2.54 
1 2:00 p.m. — 4:00 P.M. Ms. Ames yy 2.61 
1 2:00 p.m. — 4:00 P.M. Ms. Ames yy 257 
1 2:00 p.m. — 4:00 P.M. Ms. Ames J) 271 
1 2:00 p.m. — 4:00 P.M. Ms. Ames J) 2.55 
1 2:00 p.m. — 4:00 P.M. Ms. Ames JJ 252. 
1 2:00 p.m. — 4:00 P.M. Ms. Ames J) 2.69 
1 2:00 p.m. — 4:00 P.M. Ms. Ames JJ 252 
1 2:00 p.m. — 4:00 P.M. Ms. Ames JJ 257, 
1 2:00 p.m. — 4:00 P.M. Ms. Ames JJ 2.63 
1 2:00 p.m. — 4:00 P.M. Ms. Ames JJ 2.60 
1 2:00 p.m. — 4:00 P.M. Ms. Ames JJ 2.58 
1 2:00 p.m. — 4:00 P.M. Ms. Ames Jy 2.61 
1 2:00 p.m. — 4:00 P.M. Ms. Ames J 255 
1 2:00 p.m. — 4:00 P.M. Ms. Ames JJ 2.62 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB 4.22 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB 2.68 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB 2.45 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB 1.84 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB 2i 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB 3095 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB 2.46 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB 3179. 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB I 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB 222 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB 2.42 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB 2.09 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB B38) 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB 4.07 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB 2.54 
2 4:00 p.m. — 6:00 P.M. Mr. Silver DB BOG 
R] 
= DATA fil e Based on these results, from which vendor would you suggest Mr. Weideman purchase his 
CütRate new machine? 


Managerial Report 
Use graphical methods of descriptive statistics to investigate the effectiveness of each ven- 
dor. Include the following in your report: 


1. Scatter plots of the measured width of each hole (cm). 

2. Based on the scatter plots, a discussion of the effectiveness of each vendor and under 
which conditions (if any) that vendor would be acceptable. 

3. A discussion of possible sources of error in the approach taken to assess these vendors. 
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APPENDIX 


APPENDIX 3.1: DESCRIPTIVE STATISTICS WITH R (MINDTAP READER) 


STATISTICS IN PRACTICE 


Small Fry Design* 
SANTA ANA, CALIFORNIA 


Founded in 1997, Small Fry Design is a toy and acces- 
sory company that designs and imports products for in- 
fants. The company’s product line includes teddy bears, 
mobiles, musical toys, rattles, and security blankets and 
features high-quality soft toy designs with an emphasis 
on color, texture, and sound. The products are designed 
in the United States and manufactured in China. 

Small Fry Design uses independent representatives to 
sell the products to infant furnishing retailers, children’s 
accessory and apparel stores, gift shops, upscale depart- 
ment stores, and major catalog companies. Currently, 
Small Fry Design products are distributed in more than 
1000 retail outlets throughout the United States. 

Cash flow management is one of the most critical 
activities in the day-to-day operation of this company. 
Ensuring sufficient incoming cash to meet both current 
and ongoing debt obligations can mean the difference 
between business success and failure. A critical factor 
in cash flow management is the analysis and control 
of accounts receivable. By measuring the average age 
and dollar value of outstanding invoices, management 
can predict cash availability and monitor changes in the 
status of accounts receivable. The company set the fol- 
lowing goals: The average age for outstanding invoices 
should not exceed 45 days, and the dollar value of 
invoices more than 60 days old should not exceed 5% 
of the dollar value of all accounts receivable. 

In a recent summary of accounts receivable status, 
the following descriptive statistics were provided for the 
age of outstanding invoices: 


Mean 40 days 
Median 35 days 
Mode 31 days 


*The authors are indebted to John A. McCarthy, President of Small Fry 
Design, for providing this Statistics in Practice. 
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Small Fry Design uses descriptive statistics to monitor 
its accounts receivable and incoming cash flow. 
Robert Dant/Alamy Stock Photo 


Interpretation of these statistics shows that the mean or 
average age of an invoice is 40 days. The median shows 
that half of the invoices remain outstanding 35 days or 
more. The mode of 31 days, the most frequent invoice 
age, indicates that the most common length of time an 
invoice is outstanding is 31 days. The statistical sum- 
mary also showed that only 3% of the dollar value of all 
accounts receivable was more than 60 days old. Based 
on the statistical information, management was satisfied 
that accounts receivable and incoming cash flow were 
under control. 

In this chapter, you will learn how to compute and 
interpret some of the statistical measures used by 
Small Fry Design. In addition to the mean, median, 
and mode, you will learn about other descriptive 
statistics such as the range, variance, standard devi- 
ation, percentiles, and correlation. These numerical 
measures will assist in the understanding and interpre- 
tation of data. 


We discuss the process of 
point estimation in more 
detail in Chapter 7. 


The mean is sometimes 
referred to as the 


arithmetic mean. 


The sample mean x is a 


sample statistic. 


The Greek capital letter 
sigma > is the summation 


sign. 
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In Chapter 2 we discussed tabular and graphical presentations used to summarize data. In 
this chapter, we present several numerical measures that provide additional alternatives for 
summarizing data. 

We start by developing numerical summary measures for data sets consisting of a single 
variable. When a data set contains more than one variable, the same numerical measures 
can be computed separately for each variable. However, in the two-variable case, we will 
also develop measures of the relationship between the variables. 

Numerical measures of location, dispersion, shape, and association are introduced. If 
the measures are computed for data from a sample, they are called sample statistics. If the 
measures are computed for data from a population, they are called population parameters. 
In statistical inference, a sample statistic is referred to as the point estimator of the corres- 
ponding population parameter. 


3.1 Measures of Location 
Mean 


Perhaps the most important measure of location is the mean, or average value, for a 
variable. The mean provides a measure of central location for the data. If the data are for a 
sample, the mean is denoted by x; if the data are for a population, the mean is denoted by 
the Greek lower case letter mu, or u. 

In statistical formulas, it is customary to denote the value of variable x for the first 
observation by x,, the value of variable x for the second observation by x,, and so on. In 
general, the value of variable x for the ith observation is denoted by x;. For a sample with n 
observations, the formula for the sample mean is as follows. 


SAMPLE MEAN 


(3.1) 


In the preceding formula, the numerator is the sum of the values of the n observations. That 
18, 
DH Hy tý te Hx, 
To illustrate the computation of a sample mean, let us consider the following class size 
data for a sample of five college classes. 
46 54 42 46 32 
We use the notation x,, X,, X3, X4, X5 to represent the number of students in each of the 
five classes. 
x, = 46 x, = 54 x, = 42 x, = 46 x, = 32 


Hence, to compute the sample mean, we can write 


2 Bx Xp FX tx tx +X, 464+ 544+ 424+ 464 32 44 
z= - = = 
n 5 5 


The sample mean class size is 44 students. 

To provide a visual perspective of the mean and to show how it can be influenced by 
extreme values, consider the dot plot for the class size data shown in Figure 3.1. Treating the 
horizontal axis used to create the dot plot as a long, narrow board in which each of the dots 
has the same fixed weight, the mean is the point at which we would place a fulcrum or pivot 
point under the board in order to balance the dot plot. This is the same principle by which a 
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FIGURE 3.1 The Mean as the Center of Balance for the Dot Plot of the 


Classroom Size Data 


30 35 40 45 50 55 


see-saw on a playground works, the only difference being that the see-saw is pivoted in the 
middle so that as one end goes up, the other end goes down. In the dot plot we are locating 
the pivot point based upon the location of the dots. Now consider what happens to the bal- 
ance if we increase the largest value from 54 to 114. We will have to move the fulcrum under 
the new dot plot in a positive direction in order to reestablish balance. To determine how far 
we would have to shift the fulcrum, we simply compute the sample mean for the revised class 
size data. 


— De xX ty tx, tx tx, 464+ 114 +42 +46+32 280 
n 5 5 5 


= 56 


Thus, the mean for the revised class size data is 56, an increase of 12 students. In other words, 
we have to shift the balance point 12 units to the right to establish balance under the new dot 
plot. 

Another illustration of the computation of a sample mean is given in the following situ- 
ation. Suppose that a college placement office sent a questionnaire to a sample of business 
school graduates requesting information on monthly starting salaries. Table 3.1 shows the 
collected data. The mean monthly starting salary for the sample of 12 business college 
graduates is computed as 


nuy hy Fay tee Xij 


x= 
n 12 
5,850 + 5,950 + --- + 5,880 
12 
71,2 
= et = 5,940 
12 


TABLE 3.1 Monthly Starting Salaries for a Sample of 12 Business 


School Graduates 


Monthly Starting Monthly Starting 

Graduate Salary ($) Graduate Salary ($) 

1 5850 i 5890 

æ l 2 5950 8 6130 
< 

SSDATA file 3 6050 9 5940 

StartingSalaries 4 5880 10 6805 

5 5755 11 5920 

6 5710 {2 5880 
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Equation (3.1) shows how the mean is computed for a sample with n observations. The 
formula for computing the mean of a population remains the same, but we use different 
notation to indicate that we are working with the entire population. The number of observa- 
tions in a population is denoted by N and the symbol for a population mean is u. 


POPULATION MEAN 
The sample mean x is a Dh 


(3.2) 


point estimator of the = N 
population mean p. 


Median 


The median is another measure of central location. The median is the value in the middle 
when the data are arranged in ascending order (smallest value to largest value). With an 
odd number of observations, the median is the middle value. An even number of observa- 
tions has no single middle value. In this case, we follow convention and define the median 
as the average of the values for the middle two observations. For convenience the definition 
of the median is restated as follows. 


MEDIAN 


Arrange the data in ascending order (smallest value to largest value). 


(a) For an odd number of observations, the median is the middle value. 
(b) For an even number of observations, the median is the average of the two 
middle values. 


Let us apply this definition to compute the median class size for the sample of five college 
classes. Arranging the data in ascending order provides the following list. 


32 42 46 46 54 


Because n = 5 is odd, the median is the middle value. Thus the median class size is 
46 students. Even though this data set contains two observations with values of 46, each 
observation is treated separately when we arrange the data in ascending order. 

Suppose we also compute the median starting salary for the 12 business college 
graduates in Table 3.1. We first arrange the data in ascending order. 


5710 5755 5850 5880 5880 5890 5920 5940 5950 6050 6130 6325 
e_—~“’ 
Middle Two Values 


Because n = 12 is even, we identify the middle two values: 5890 and 5920. The median is 
the average of these values. 


. 5890 + 5920 
Median = 5 = 5905 


The procedure we used to compute the median depends upon whether there is an odd 
number of observations or an even number of observations. Let us now describe a more 
conceptual and visual approach using the monthly starting salary for the 12 business col- 
lege graduates. As before, we begin by arranging the data in ascending order. 


5710 5755 5850 5880 5880 5890 5920 5940 5950 6050 6130 6325 


Once the data are in ascending order, we trim pairs of extreme high and low values until 
no further pairs of values can be trimmed without completely eliminating all the data. For 
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instance after trimming the lowest observation (5710) and the highest observation (6325) 
we obtain a new data set with 10 observations. 


530 5755 5850 5880 5880 5890 5920 5940 5950 6050 6130 65 


We then trim the next lowest remaining value (5755) and the next highest remaining value 
(6130) to produce a new data set with eight observations. 


57H) 5785 5850 5880 5880 5890 5920 5940 5950 6050 6130 63% 


Continuing this process we obtain the following results. 


57K) 5785 5880 5880 5880 5890 5920 5940 5950 6850 630 6305 


59x0 5785 5880 5880 5880 5890 5920 5940 5950 6030 6130 6325 
50 5755 5850 «5880 «5880 5890 5920 5940 5950 6080 6X0 6385 


At this point no further trimming is possible without eliminating all the data. So, the 
median is just the average of the remaining two values. When there is an even number 
of observations, the trimming process will always result in two remaining values, and 
the average of these values will be the median. When there is an odd number of observa- 
tions, the trimming process will always result in one final value, and this value will be the 
median. Thus, this method works whether the number of observations is odd or even. 
The median is the measure Although the mean is the more commonly used measure of central location, in some 
of location most often situations the median is preferred. The mean is influenced by extremely small and large 
reported for annual income data values. For instance, suppose that the highest paid graduate (see Table 3.1) had a 
and property value data starting salary of $15,000 per month (maybe the individual’s family owns the company). If 
because a few extremely we change the highest monthly starting salary in Table 3.1 from $6325 to $15,000 and re- 
large incomes or property compute the mean, the sample mean changes from $5940 to $6663. The median of $5905, 


values can inflate the however, is unchanged, because $5890 and $5920 are still the middle two values. With the 
mean. In such cases, the extremely high starting salary included, the median provides a better measure of central 
median is the preferred location than the mean. We can generalize to say that whenever a data set contains extreme 
measure of central values, the median is often the preferred measure of central location. 
location. 

Mode 


Another measure of location is the mode. The mode is defined as follows. 


MODE 
The mode is the value that occurs with greatest frequency. 


To illustrate the identification of the mode, consider the sample of five class sizes. The 
only value that occurs more than once is 46. Because this value, occurring with a fre- 
quency of 2, has the greatest frequency, it is the mode. As another illustration, consider the 
sample of starting salaries for the business school graduates. The only monthly starting 
salary that occurs more than once is $5880. Because this value has the greatest frequency, 
it is the mode. 

Situations can arise for which the greatest frequency occurs at two or more different 
values. In these instances more than one mode exists. If the data contain exactly two 
modes, we say that the data are bimodal. If data contain more than two modes, we say 
that the data are multimodal. In multimodal cases the mode is almost never reported 
because listing three or more modes would not be particularly helpful in describing a 
location for the data. 
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Using Excel to Compute the Mean, Median, and Mode 


Excel provides functions for computing the mean, median, and mode. We illustrate the use 


eS : of these functions by computing the mean, median, and mode for the starting salary data in 
= DATA file Table 3.1. Refer to Figure 3.2 as we describe the tasks involved. The formula worksheet is 
StartingSalaries in the background; the value worksheet is in the foreground. 


Enter/Access Data: Open the file StartingSalaries. The data are in cells B2:B13 and labels 
are in column A and cell B1. 


Enter Functions and Formulas: Excel’s AVERAGE function can be used to compute the 
mean by entering the following formula in cell E2: 


=AVERAGE(B2:B13) 


Similarly, the formulas =MEDIAN(B2:B13) and =MODE.SNGL(B2:B13) are entered in 
cells E3 and E4, respectively, to compute the median and the mode. 

The formulas in cells E2:E4 are displayed in the background worksheet of Figure 3.2 
and the values computed using the Excel functions are displayed in the foreground work- 
sheet. Labels were also entered in cell D2:D4 to identify the output. Note that the mean 
(5940), median (5905), and mode (5880) are the same as we computed earlier. 


Weighted Mean 


In the formulas for the sample mean and population mean, each x; is given equal im- 
portance or weight. For instance, the formula for the sample mean can be written 
as follows: 


1 


_ In 1 Dogal l 
z=% -1(3x) Tee ee a A a Aea e 


This shows that each observation in the sample is given a weight of 1/n. Although 
this practice is most common, in some instances the mean is computed by giving each 


FIGURE 3.2 | Excel Worksheet Used to Compute the Mean, Median, and 
Mode for the Starting Salary Data 


4 A B c D E 
Monthly Starting 
1 Graduate Salary (5) 
251 
2 2 
43 
5 4 
6 5 
76 A A B c D E 
27 Monthly Starting 
a lg 1 Graduate Salary ($) 
1019 2 1 Mean 
i1 10 3 2 Median 
4 3 Mode 
12 11 
13 12 k 
14 $ 5 
7 6 
8 7 
9 8 
10 9 
11 10 
12 11 
13 12 
14 
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observation a weight that reflects its relative importance. A mean computed in this 
manner is referred to as a weighted mean. The weighted mean is computed as follows: 


WEIGHTED MEAN 


(3.3) 


where 


w; = weight for observation i 


When the data are from a sample, equation (3.3) provides the weighted sample mean. If the 
data are from a population, u replaces x and equation (3.3) provides the weighted popula- 
tion mean. 

As an example of the need for a weighted mean, consider the following sample of five 
purchases of a raw material over the past three months. 


Purchase Cost per Pound ($) Number of Pounds 
1 3.00 1200 
2 3.40 500 
3 2.80 2750 
4 2.90 1000 
5 3125 800 


Note that the cost per pound varies from $2.80 to $3.40, and the quantity purchased varies 
from 500 to 2750 pounds. Suppose that a manager wanted to know the mean cost per 
pound of the raw material. Because the quantities ordered vary, we must use the formula 
for a weighted mean. The five cost-per-pound data values are x, = 3.00, x,= 3.40, x, = 
2.80, x, = 2.90, and x, = 3.25. The weighted mean cost per pound is found by weighting 
each cost by its corresponding quantity. For this example, the weights are w, = 1200, 
w, = 500, w, = 2750, w, = 1000, and w, = 800. Based on equation (3.3), the weighted 
mean is calculated as follows: 


__ 1200(3.00) + 500(3.40) + 2750(2.80) + 1000(2.90) + 800(3.25) 
A 1200 + 500 + 2750 + 1000 + 800 


18,500 
6250 


= 2.96 


Thus, the weighted mean computation shows that the mean cost per pound for the raw 
material is $2.96. Note that using equation (3.1) rather than the weighted mean formula in 
equation (3.3) would provide misleading results. In this case, the sample mean of the five 
cost-per-pound values is (3.00 + 3.40 + 2.80 + 2.90 + 3.25)/5 = 15.35/5 = $3.07, which 
overstates the actual mean cost per pound purchased. 

The choice of weights for a particular weighted mean computation depends upon the 
application. An example that is well known to college students is the computation of a 
grade point average (GPA). In this computation, the data values generally used are 4 for 
an A grade, 3 for a B grade, 2 for a C grade, 1 for a D grade, and 0 for an F grade. The 
weights are the number of credit hours earned for each grade. Exercise 16 at the end of this 
section provides an example of this weighted mean computation. In other weighted mean 
computations, quantities such as pounds, dollars, or volume are frequently used as weights. 
In any case, when observations vary in importance, the analyst must choose the weight that 
best reflects the importance of each observation in the determination of the mean. 
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S DATA file 
MutualFund 


The growth factor for each 
year is 1 plus .01 times 
the percentage return. A 
growth factor less than 1 
indicates negative growth, 
while a growth factor 
greater than 1 indicates 
positive growth. The 
growth factor cannot be 
less than zero. 
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TABLE 3.2 Percentage Annual Returns and Growth Factors for 
the Mutual Fund Data 


Year Return (%) Growth Factor 
il —22.1 0.779 
2 28.7 i287 
3 10.9 1.109 
4 4.9 1.049 
5 15.8 1158 
6 55 1.055 
7 — 0) 0.630 
8 26.5 1.265 
9 TS MS] 

10 Pal 1.021 


Geometric Mean 


The geometric mean is a measure of location that is calculated by finding the nth root of 
the product of n values. The general formula for the geometric mean, denoted x,, follows. 


GEOMETRIC MEAN 
X= W (a) E = (1) a) 1” (3.4) 


The geometric mean is often used in analyzing growth rates in financial data. In these types 
of situations the arithmetic mean or average value will provide misleading results. 

To illustrate the use of the geometric mean, consider Table 3.2, which shows the per- 
centage annual returns, or growth rates, for a mutual fund over the past 10 years. Suppose 
we want to compute how much $100 invested in the fund at the beginning of year 1 would 
be worth at the end of year 10. Let’s start by computing the balance in the fund at the end 
of year 1. Because the percentage annual return for year 1 was —22.1%, the balance in the 
fund at the end of year 1 would be 


$100 — .221($100) = $100(1 — .221) = $100(.779) = $77.90 


Note that .779 is identified as the growth factor for year 1 in Table 3.2. This result shows 
that we can compute the balance at the end of year 1 by multiplying the value invested in 
the fund at the beginning of year | times the growth factor for year 1. 

The balance in the fund at the end of year 1, $77.90, now becomes the beginning bal- 
ance in year 2. So, with a percentage annual return for year 2 of 28.7%, the balance at the 
end of year 2 would be 


$77.90 + .287($77.90) = $77.90(1 + .287) = $77.90(1.287) = $100.2573 


Note that 1.287 is the growth factor for year 2. And, by substituting $100(.779) for $77.90, 
we see that the balance in the fund at the end of year 2 is 


$100(.779) (1.287) = $100.2573 


In other words, the balance at the end of year 2 is just the initial investment at the begin- 
ning of year 1 times the product of the first two growth factors. This result can be gen- 
eralized to show that the balance at the end of year 10 is the initial investment times the 
product of all 10 growth factors. 


$100[(.779)(1.287)(1.109)(1.049)(1.158)(1.055)(.630)(1.265)(1.151)(1.021)] = 
$100(1.334493) = $133.4493 
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So, a $100 investment in the fund at the beginning of year 1 would be worth $133.4493 
at the end of year 10. Note that the product of the 10 growth factors is 1.334493. Thus, 
we can compute the balance at the end of year 10 for any amount of money invested at 
the beginning of year 1 by multiplying the value of the initial investment times 1.334493. 
For instance, an initial investment of $2500 at the beginning of year 1 would be worth 
$2500(1.334493) or approximately $3336 at the end of year 10. 

What was the mean percentage annual return or mean rate of growth for this invest- 
ment over the 10-year period? The geometric mean of the 10 growth factors can be used 
to answer to this question. Because the product of the 10 growth factors is 1.334493, the 
geometric mean is the 10th root of 1.334493 or 


10, 
xX, = V 1.334493 = 1.029275 


The nth root can be 
computed using most 
calculators or by using the 
POWER function in Excel. 
For instance, using Excel, 
the 10th root of 1.334493 
=POWER(1.334493, 1/10) 
or 1.029275. 


The geometric mean tells us that annual returns grew at an average annual rate of 
(1.029275 — 1)100% or 2.9275%. In other words, with an average annual growth rate 
of 2.9275%, a $100 investment in the fund at the beginning of year 1 would grow to 
$100(1.029275)'° = $133.4493 at the end of 10 years. 

It is important to understand that the arithmetic mean of the percentage annual returns 
does not provide the mean annual growth rate for this investment. The sum of the 10 an- 
nual percentage returns in Table 3.2 is 50.4. Thus, the arithmetic mean of the 
10 percentage annual returns is 50.4/10 = 5.04%. A broker might try to convince you to 
invest in this fund by stating that the mean annual percentage return was 5.04%. Such 
a statement is not only misleading, it is inaccurate. A mean annual percentage return of 
5.04% corresponds to an average growth factor of 1.0504. So, if the average growth factor 
were really 1.0504, $100 invested in the fund at the beginning of year 1 would have grown 
to $100(1.0504)'° = $163.51 at the end of 10 years. But, using the 10 annual percentage 
returns in Table 3.2, we showed that an initial $100 investment is worth $133.45 at the end 
of 10 years. The broker’s claim that the mean annual percentage return is 5.04% grossly 
overstates the true growth for this mutual fund. The problem is that the sample mean is 
only appropriate for an additive process. For a multiplicative process, such as applications 
involving growth rates, the geometric mean is the appropriate measure of location. 

While the applications of the geometric mean to problems in finance, investments, and 
banking are particularly common, the geometric mean should be applied any time you want 
to determine the mean rate of change over several successive periods. Other common applic- 
ations include changes in populations of species, crop yields, pollution levels, and birth and 
death rates. Also note that the geometric mean can be applied to changes that occur over any 
number of successive periods of any length. In addition to annual changes, the geometric mean 
is often applied to find the mean rate of change over quarters, months, weeks, and even days. 


Using Excel to Compute the Geometric Mean 


Excel’s GEOMEAN function can be used to compute the geometric mean for the mutual 
= DATA fi le fund data in Table 3.2. Refer to Figure 3.3 as we describe the tasks involved. The formula 


worksheet is in the background; the value worksheet is in the foreground. 
MutualFund 


Enter/Access Data: Open the file MutualFund. The data are in cells B2:B11 and labels are 
in column A and cell B2. 


Enter Functions and Formulas: To compute the growth factor for the percentage return 
in cell B2 (—22.1) we entered the following formula in cell C2: 
=1+.01*B2 


To compute the growth factors for the other percentage returns we copied the same formula 
in cells C3:C11. Excel’s GEOMEAN function can now be used to compute the geometric 
mean for the growth factors in cells C2:C11 by entering the following formula in cell F2: 


=GEOMEAN(C2:C11) 
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Using Excel to Compute the Geometric Mean for the Mutual 
Fund Data 
č D | E | F G 


dA Ai | B | 
l [Year |Return(%) Growth Factor 
2\1 -22.1 =1+0.01*B2 Geometric Mean =GEOMEAN(C2:C11) 
22 28.7 =1+0.01*B3 
4 |3 10.9 =140.01*B4 
54 4.9 =1+0.01*B5 A| A | B | © D | E F | G 
6 |5 15.8 =1+0.01*B6 1 [Year [Return (%) Growth Factor 
7 |6 55 =1+0.01*B7 2 1 -22.1 0.779 Geometric Mean 1.029275 
8 (7 -37 =1+0.01*B8 30 2 28.7 1.287 
9 |8 26.5 =1+0.01*B9 4 | 3 10.9 1.109 
10 |9 15.1 =1+0.01*B10 5 | 4 49 1.049 
11/10 ZI =1+0.01*B11 6 | 5 15.8 1.158 
12 7 | 6 5S 1.055 
8 | 7 -37 0.63 
9 | 8 26.5 1.265 
10 | 9 15.1 1.151 
11 | 10 2.1 1.021 
iP) 


The labels “Growth Factor” and “Geometric Mean” were entered in cells C1 and E2, 
respectively, to identify the output. Note that the geometric mean (1.029275) is the same 
value as we computed earlier. 


Percentiles 


A percentile provides information about how the data are spread over the interval from 
the smallest value to the largest value. For a data set containing n observations, the pth 
percentile divides the data into two parts: Approximately p% of the observations are less 
than the pth percentile, and approximately (100 — p)% of the observations are greater than 
the pth percentile. 

Colleges and universities frequently report admission test scores in terms of per- 
centiles. For instance, suppose an applicant obtains a score of 630 on the math portion 
of an admissions test. How this applicant performed in relation to others taking the 
same test may not be readily apparent. However, if the score of 630 corresponds to 
the 82nd percentile, we know that approximately that 82% of the applicants scored 
lower than this individual and approximately 18% of the applicants scored higher than 
this individual. 

To calculate the pth percentile for a data set containing n observations, we must first 
arrange the data in ascending order (smallest value to largest value). The smallest value 
is in position 1, the next smallest value is in position 2, and so on. The location of the pth 
percentile, denoted L,, is computed using the following equation. 


Several procedures can 


Deiceentescomp ine LOCATION OF THE PTH PERCENTILE 

the location of the pth 

percentile using sample i= 2 o i) (3.5) 
data. All provide similar 2 100 


values, especially for 


'arge cata sots The To illustrate the computation of the pth percentile, let us compute the 80th percentile 


for the starting salary data in Table 3.1. We begin by arranging the sample of 12 starting 
salaries in ascending order. 


procedure we show here 
is the procedure used by 
Excel's PERCENTILE.EXC 
function as well as several 


B 5710 5755 5850 5880 5880 5890 5920 5940 5950 6050 6130 6325 
other statistical software 
packages. Position 1 2 3 4 5 6 7 8 9 10 11 12 
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The position of each observation in the sorted data is shown directly below its value. For 
instance, the smallest value (5710) is in position 1, the next smallest value (5755) is in 
position 2, and so on. Using equation (3.5) with p = 80 and n = 12, the location of the 
80th percentile is 
p 80 
Ley =- (n + 1) = | (12 + 1)= 10.4 
1 799 @ P (a ) 
The interpretation of Lgo = 10.4 is that the 80th percentile is 40% of the way between the 
value in position 10 and the value in position 11. In other words, the 80th percentile is the 


value in position 10 (6050) plus 0.4 times the difference between the value in position 11 
(6130) and the value in position 10 (6050). Thus, 


80th percentile = 6050 + .4(6130 — 6050) = 6050 + .4(80) = 6082 


Let us now compute the 50th percentile for the starting salary data. With p = 50 and 
n = 12, the location of the 50th percentile is 


Pp 50 
Lo = =~ (n+ 1) = {—~]C12 +1)=6.5 
50 ~ 190 (n ) k o) ) 
With L;,= 6.5, we see that the 50th percentile is 50% of the way between the value in 
position 6 (5890) and the value in position 7 (5920). Thus, 


50th percentile = 5890 + .5(5920 — 5890) = 5890 + .5(30) = 5905 


Note that the 50th percentile is also the median. 


Quartiles 


Quartiles are specific per It is often desirable to divide a data set into four parts, with each part containing approxim- 
centiles; thus, the steps for ately one-fourth, or 25%, of the observations. These division points are referred to as the 
computing percentiles can quartiles and are defined as follows: 
be applied directly in the 
computation of quartiles. Q, = first quartile, or 25th percentile 

Q, = second quartile, or 50th percentile (also the median) 


Q, = third quartile, or 75th percentile 


Because quartiles are just specific percentiles, the procedure for computing percentiles can 
be used to compute the quartiles. 

To illustrate the computation of the quartiles for a data set consisting of n observa- 
tions, we will compute the quartiles for the starting salary data in Table 3.1. Previously we 
showed that the 50th percentile for the starting salary data is 5905; thus, the second quartile 
(median) is Q, = 5905. To compute the first and third quartiles we must find the 25th and 
75th percentiles. The calculations follow. 


For Q; 


2 
EE EE E Gow 1 998 
235 100 100 


The first quartile, or 25th percentile, is .25 of the way between the value in position 3 
(5850) and the value in position 4 (5880). Thus, 


Q, = 5850 + .25(5880 — 5850) = 5850 + .25(30) = 5857.5 
For Q3, 


L. = 2 a+p=|(2 402+ 1) = 9.75 
= —(n = | — = 
75 100 100 
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The third quartile, or 75th percentile, is .75 of the way between the value in position 9 
(5950) and the value in position 10 (6050). Thus, 


Q, = 5950 + .75(6050 — 5950) = 5950 + .75(100) = 6025 


We defined the quartiles as the 25th, 50th, and 75th percentiles. Thus, we computed the 
quartiles in the same way as percentiles. However, other conventions are sometimes used to 
compute quartiles, and the actual values reported for quartiles may vary slightly depending 
on the convention used. Nevertheless, the objective of all procedures for computing quart- 
iles is to divide the data into four equal parts. 


Using Excel to Compute Percentiles and Quartiles 


Excel provides functions for computing percentiles and quartiles. We will illustrate the 
use of these functions by showing how to compute the pth percentile and the quartiles 
DATA file for the starting salary data in Table 3.1. Refer to Figure 3.4 as we describe the tasks 
StartingSalaries involved. The formula worksheet is in the background; the value worksheet is in the 
foreground. 


(Q 


Enter/Access Data: Open the file StartingSalaries. The data are in cells B2:B13 and labels 
are in column A and cell B1. 


Enter Functions and Formulas: Excel’ s PERCENTILE.EXC function can be used 
to compute the pth percentile. For the starting salary data the general form of this 
function is 


=PERCENTILE.EXC(B2:B13,p/100) 


If we wanted to compute the 80th percentile for the starting salary data we could enter the 
value of 80 into cell D2 and then enter the formula 


=PERCENTILE.EXC(B2:B13,D2/100) 


in cell E2. 
FIGURE 3.4 | Using Excel to Compute Percentiles and Quartiles 
d A 8 c D E 
Monthly Starting 
1 Graduate Salary (S) Percentile Value 
2l 80 PERCENTILE.EXC(B2:B13 
3 |2 
4 |3 
5 j4 Quartile Value 
6 5 1 QUART XC B$] 
76 2 B$13 
817 3 $B$13 
98 
10 9 Á A B c D E 
11 10 Monthly Starting 
12 11 1 Graduate Salary (5) Percentile Value 
13 12 2 1 80 
14 3 2 
4 3 
5 4 Quartile Value 
6 5 1 
7 6 2 
8 T 3 
9 8 
10 9 
11 10 
12 11 
13 12 
14 
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Because the quartiles are just the 25th, 50th, and 75th percentiles, we could compute 
the quartiles for the starting salary data by using Excel’s PERCENTILE.EXC function as 
described above. But we can also use Excel’s QUARTILE.EXC function to compute the 
quartiles. For the starting salary data, the general form of this function is 


=QUARTILE.EXC(B2:B 13,quart) 


where quart = 1 for the first quartile, quart = 2 for the second quartile, and quart = 3 

for the third quartile. To illustrate the use of this function for computing the quartiles we 
entered the values /, 2, and 3 in cells D5:D7 of the worksheet. To compute the first quartile 
we entered the following function in cell E5: 


= QUARTILE.EXC($B$2:$B$13,D5) 


To compute the second and third quartiles we copied the formula in cell E5 in cells E6 and 
E7. Labels were entered in cells D4 and E4 to identify the output. Note that the three quart- 
iles (5857.5, 5905, and 6025) are the same values as computed previously. 


NOTES + COMMENTS 


It is better to use the median than the mean as a mea- 
sure of central location when a data set contains ex- 
treme values. Another measure that is sometimes used 
when extreme values are present is the trimmed mean. 
The trimmed mean is obtained by deleting a percent- 
age of the smallest and largest values from a data set 
and then computing the mean of the remaining values. 
For example, the 5% trimmed mean is obtained by 
removing the smallest 5% and the largest 5% of the data 


values and then computing the mean of the remaining 


EXERCISES 


Methods 


values. Using the sample with n = 12 starting salaries, 
0.05(12) = 0.6. Rounding this value to 1 indicates that 
the 5% trimmed mean is obtained by removing the 
smallest data value and the largest data value and then 
computing the mean of the remaining 10 values. For the 
starting salary data, the 5% trimmed mean is 5924.50. 


. Other commonly used percentiles are the quintiles 


(the 20th, 40th, 60th, and 80th percentiles) and the deciles 
(the 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, and 
90th percentiles). 


1. Consider a sample with data values of 10, 20, 12, 17, and 16. Compute the mean and 


median. 


2. Consider a sample with data values of 10, 20, 21, 17, 16, and 12. Compute the mean 


and median. 


3. Consider the following data and corresponding weights. 


a. Compute the weighted mean. 


Weight (w,) 
6 


3 
2 
8 


b. Compute the sample mean of the four data values without weighting. Note the 
difference in the results provided by the two computations. 
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4. Consider the following data. 


Period Rate of Return (%) 
1 —6.0 
2 —8.0 
3 —4.0 
4 2.0 
5 5.4 


What is the mean growth rate over these five periods? 

5. Consider a sample with data values of 27, 25, 20, 15, 30, 34, 28, and 25. Compute the 
20th, 25th, 65th, and 75th percentiles. 

6. Consider a sample with data values of 53, 55, 70, 58, 64, 57, 53, 69, 57, 68, and 53. 
Compute the mean, median, and mode. 


Applications 
7. eICU Waiting Times. There is a severe shortage of critical care doctors and nurses to 

provide intensive-care services in hospitals. To offset this shortage, many hospitals, such as 
Emory Hospital in Atlanta, are using electronic intensive-care units (eICUs) to help provide 
this care to patients (Emory University News Center). eICUs use electronic monitoring tools 
and two-way communication through video and audio so that a centralized staff of specially 
trained doctors and nurses—who can be located as far away as Australia—can provide 
critical care services to patients located in remote hospitals without fully staffed ICUs. One 
of the most important metrics tracked by these eICUs is the time that a patient must wait for 
the first video interaction between the patient and the eICU staff. Consider the following 
sample of 40 patient waiting times until their first video interaction with the eICU staff. 


SDATAfile ee 
sicu Wait Time (minutes) 
40 46 49 44 
45 45 38 5i 
42 46 41 45 
49 41 48 42 
49 40 42 43 
43 42 41 41 
55 43 42 40 
42 40 49 43 
44 45 61 S 
40 37 39 43 


a. Compute the mean waiting time for these 40 patients. 
b. Compare the median waiting time. 
c. Compute the mode. 
d. Compute the first and third quartiles. 

8. Middle-Level Manager Salaries. Suppose that an independent study of middle- 
level managers employed at companies located in Atlanta, Georgia, was conducted to 
compare the salaries of managers working at firms in Atlanta to the salaries of middle- 
level managers across the nation. The following data show the salary, in thousands of 
dollars, for a sample of 15 middle-level managers. 


108 83 106 73 53 85 80 63 67 75 124 55 93 118 77 


a. Compute the median salary for the sample of 15 middle-level managers. Suppose 
the median salary of middle-level managers employed at companies located across 
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the nation is $85,000. How does the median for middle-level managers in the 
Atlanta area compare to the median for managers across the nation? 
b. Compute the mean annual salary for managers in the Atlanta area and discuss how 
and why it differs from the median computed in part (a). 
c. Compute the first and third quartiles for salaries of middle-level managers in the 
Atlanta area. 
e DATA fi le 9. Advertising Spending. Which companies spend the most money on advertising? 
— Business Insider maintains a list of the top-spending companies. In 2014, Procter & 
Gamble spent more than any other company, a whopping $5 billion. In second place 
was Comcast, which spent $3.08 billion (Business Insider website). The top 12 com- 
panies and the amount each spent on advertising in billions of dollars are as follows: 


AdvertisingSpend 


Advertising Advertising 
Company ($ billions) Company ($ billions) 
Procter & Gamble $5.00 American Express $2.19 
Comcast 3.08 General Motors 215 
AT&T 2911 Toyota 2.09 
Ford 256 Fiat Chrysler 1 
Verizon 2.44 Walt Disney Company 1.96 
L'Oreal 2.34 J.P. Morgan 1.88 


a. What is the mean amount spent on advertising? 

b. What is the median amount spent on advertising? 

c. What are the first and third quartiles? 
e DATA fi le 10. Hardshell Jacket Ratings. OutdoorGearLab is an organization that tests outdoor gear 
— used for climbing, camping, mountaineering, and backpacking. Suppose that the fol- 
lowing data show the ratings of hardshell jackets based on the breathability, durability, 
versatility, features, mobility, and weight of each jacket. The ratings range from 0 
(lowest) to 100 (highest). 


42 66 67 71 78 62 61 76 71 67 
61 64 61 54 83 63 68 69 81 53 


JacketRatings 


a. Compute the mean, median, and mode. 

b. Compute the first and third quartiles. 

c. Compute and interpret the 90th percentile. 

11. Time Spent Watching Traditional TV. Nielsen tracks the amount of time that people 

spend consuming media content across different platforms (digital, audio, television) in the 

United States. Nielsen has found that traditional television viewing habits vary based on the 

age of the consumer as an increasing number of people consume media through streaming 

devices (Nielsen website). The following data represent the weekly traditional TV viewing 
e DATA fil e hours in 2016 for a sample of 14 people aged 18-34 and 12 people aged 35—49. 


TelevisionViewing Viewers aged 18—34: 24.2, 21.0, 17.8, 19.6, 23.4, 19.1, 14.6, 27.1, 19.2, 18.3, 
22.9, 23.4, 17.3, 20.5 
Viewers aged 35—49: 24.9, 34.9, 35.8, 31.9, 35.4, 29.9, 30.9, 36.7, 36.2, 33.8, 
29.5, 30.8 


a. Compute the mean and median weekly hours of traditional TV viewed by those 
aged 18-34. 

b. Compute the mean and median weekly hours of traditional TV viewed by those 
aged 35—49. 

c. Compare the mean and median viewing hours for each age group. Which group 
watches more traditional TV per week? 
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12. Online Multiplayer Game Downloads. The creator of a new online multiplayer 
survival game has been tracking the monthly downloads of the game. The following 
table shows the monthly downloads (in thousands) for each month of the current and 
previous year. 


Month Downloads Month Downloads 
(previous year) (thousands) (current year) (thousands) 
February 33.0 January 37.0 
B : March 34.0 Februa 37.0 
= DATA file April 34.0 March 4 37.0 
onean May 32.0 April 38.0 
June 32.0 May 37.0 
July 35.0 June 36.0 
August 34.0 July 37.0 
September 8720 August 35.0 
October 37.0 September 33.0 
November 35.0 October 32.0 
December 33.0 

a. Compute the mean, median, and mode for number of downloads in the 
previous year. 

b. Compute the mean, median, and mode for number of downloads in the 
current year. 

c. Compute the first and third quartiles for downloads in the previous year. 

d. Compute the first and third quartiles for downloads in the current year. 

e. Compare the values calculated in parts (a) through (d) for the previous and current 
years. What does this tell you about the downloads of the game in the current year 
compared to the previous year? 

13. Automobile Fuel Efficiencies. In automobile mileage and gasoline-consumption 
testing, 13 automobiles were road tested for 300 miles in both city and highway 
driving conditions. The following data were recorded for miles-per-gallon 
performance. 

City: 16.2 16.7 15.9 14.4 13.2 15.3 16.8 16.0 16.1 15.3 15.2 15.3 16.2 

Highway: 19.4 20.6 18.3 18.6 19.2 17.4 17.2 18.6 19.0 21.1 19.4 18.5 18.7 

Use the mean, median, and mode to make a statement about the difference in perform- 

ance for city and highway driving. 

= DATA fil e 14. Unemployment Rates by State. The U.S. Bureau of Labor Statistics collects data on 


unemployment rates in each state. The data contained in the file UnemploymentRates 
show the unemployment rate for every state over two consecutive years. To compare 
unemployment rates for the previous year with unemployment rates for the current 
year, compute the first quartile, the median, and the third quartile for the previous year 
unemployment data and the current year unemployment data. What do these statistics 
suggest about the change in unemployment rates across the states over these 

two years? 

15. Motor Oil Prices. Martinez Auto Supplies has retail stores located in eight cities in 
California. The price they charge for a particular product in each city varies because of 
differing competitive conditions. For instance, the price they charge for a case of a popular 
brand of motor oil in each city follows. Also shown are the number of cases that Martinez 
Auto sold last quarter in each city. 


UnemploymentRates 
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City Price ($) Sales (cases) 
Bakersfield 34.99 501 
Los Angeles 38.99 1425 
Modesto 36.00 294 
Oakland 33159 882 
Sacramento 40.99 715 
San Diego 38.59 1088 
San Francisco 39259 1644 
San Jose S) 819 


Compute the average sales price per case for this product during the last quarter. 

16. Calculating Grade Point Averages. The grade point average for college students is 
based on a weighted mean computation. For most colleges, the grades are given the 
following data values: A (4), B (3), C (2), D (1), and F (0). After 60 credit hours of 
course work, a student at State University earned 9 credit hours of A, 15 credit hours 
of B, 33 credit hours of C, and 3 credit hours of D. 

a. Compute the student’s grade point average. 

b. Students at State University must maintain a 2.5 grade point average for their first 
60 credit hours of course work in order to be admitted to the business college. Will 
this student be admitted? 

17. Mutual Fund Rate of Return. The following table shows the total return and the 
number of funds for four categories of mutual funds. 


Type of Fund Number of Funds Total Return (%) 
Domestic Equity 9191 4.65 
International Equity 2621 18.15 
Specialty Stock 1419 11.36 
Hybrid 2900 6.75 


a. Using the number of funds as weights, compute the weighted average total return 
for these mutual funds. 

b. Is there any difficulty associated with using the “number of funds” as the weights in 
computing the weighted average total return in part (a)? Discuss. What else might 
be used for weights? 

c. Suppose you invested $10,000 in this group of mutual funds and diversified the 
investment by placing $2,000 in Domestic Equity funds, $4,000 in International 
Equity funds, $3,000 in Specialty Stock funds, and $1,000 in Hybrid funds. What is 
the expected return on the portfolio? 

18. Business School Ranking. Based on a survey of master’s programs in business 
administration, magazines such as U.S. News & World Report rank U.S. business 
schools. These types of rankings are based in part on surveys of business school 
deans and corporate recruiters. Each survey respondent is asked to rate the over- 
all academic quality of the master’s program on a scale from | “marginal” to 5 
“outstanding.” Use the sample of responses shown below to compute the weighted 
mean score for the business school deans and the corporate recruiters. Discuss. 


Quality Assessment Business School Deans Corporate Recruiters 
5 44 ei 
4 66 34 
3 60 43 
2 10 12 
1 0) 0 
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19. Revenue Growth Rate. Annual revenue for Corning Supplies grew by 5.5% in 2014; 
1.1% in 2015; —3.5% in 2016; —1.1% in 2017; and 1.8% in 2018. What is the mean 
growth annual rate over this period? 


20. Mutual Fund Comparison. Suppose that at the beginning of Year | you invested 
$10,000 in the Stivers mutual fund and $5,000 in the Trippi mutual fund. The value of 
each investment at the end of each subsequent year is provided in the table below. Which 
mutual fund performed better? 


Year Stivers Trippi 
í 11,000 5,600 
2 12,000 6,300 
3 13,000 6,900 
4 14,000 7,600 
5 15,000 8,500 
6 16,000 9,200 
7 17,000 9,900 
8 18,000 10,600 


21. Asset Growth Rate. If an asset declines in value from $5000 to $3500 over nine 
years, what is the mean annual growth rate in the asset’s value over these nine years? 
22. Company Value Growth Rate. The current value of a company is $25 million. 
If the value of the company six years ago was $10 million, what is the company’s 
mean annual growth rate over the past six years? 


3.2 Measures of Variability 


In addition to measures of location, it is often desirable to consider measures of variab- 
The variability in the ility, or dispersion. For example, suppose that you are a purchasing agent for a large 
delivery time creates manufacturing firm and that you regularly place orders with two different suppliers. 
uncertainty for production After several months of operation, you find that the mean number of days required to 
scheduling. Methods in this fill orders is 10 days for both of the suppliers. The histograms summarizing the num- 
section help measure and ber of working days required to fill orders from the suppliers are shown in Figure 3.5. 
understand variability. Although the mean number of days is 10 for both suppliers, do the two suppliers 


FIGURE 3.5 Historical Data Showing the Number of Days Required to Fill Orders 
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demonstrate the same degree of reliability in terms of making deliveries on sched- 
ule? Note the dispersion, or variability, in delivery times indicated by the histograms. 
Which supplier would you prefer? 

For most firms, receiving materials and supplies on schedule is important. The 7- or 
8-day deliveries shown for J.C. Clark Distributors might be viewed favorably; however, a 
few of the slow 13- to 15-day deliveries could be disastrous in terms of keeping a work- 
force busy and production on schedule. This example illustrates a situation in which the 
variability in the delivery times may be an overriding consideration in selecting a supplier. 
For most purchasing agents, the lower variability shown for Dawson Supply, Inc. would 
make Dawson the preferred supplier. 

We turn now to a discussion of some commonly used measures of variability. 


Range 


The simplest measure of variability is the range. 


RANGE 


Range = Largest value — Smallest value 


Let us refer to the data on starting salaries for business school graduates in Table 3.1. The 
largest starting salary is 6325 and the smallest is 5710. The range is 6325 — 5710 = 615. 
Although the range is the easiest of the measures of variability to compute, it is sel- 
dom used as the only measure. The reason is that the range is based on only two of the 
observations and thus is highly influenced by extreme values. Suppose the highest paid 
graduate received a starting salary of $15,000 per month. In this case, the range would 
be 15,000 — 5710 = 9290 rather than 615. This large value for the range would not be 
especially descriptive of the variability in the data because 11 of the 12 starting salaries are 
closely grouped between 5710 and 6130. 


Interquartile Range 


A measure of variability that overcomes the dependency on extreme values is the inter- 
quartile range (IQR). This measure of variability is the difference between the third 
quartile, Q}, and the first quartile, Q,. In other words, the interquartile range is the range 
for the middle 50% of the data. 


INTERQUARTILE RANGE 
IOR = O = O (3.6) 


For the data on monthly starting salaries, the quartiles are Q}, = 6025 and Q, = 5857.5. 
Thus, the interquartile range is 6025 — 5857.5 = 167.5. 


Variance 


The variance is a measure of variability that utilizes all the data. The variance is based 
on the difference between the value of each observation (x;) and the mean. The difference 
between each x; and the mean (x for a sample, u for a population) is called a deviation 
about the mean. For a sample, a deviation about the mean is written (x; — x); for a popu- 
lation, it is written (x; — u). In the computation of the variance, the deviations about the 
mean are squared. 

If the data are for a population, the average of the squared deviations is called the popu- 
lation variance. The population variance is denoted by the square of the Greek lower case 
letter sigma, or g”. For a population of N observations and with u denoting the population 
mean, the definition of the population variance is as follows. 
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POPULATION VARIANCE 


= Gis by 


4 N 


(3.7) 


In most statistical applications, the data being analyzed are for a sample. When we 
compute a sample variance, we are often interested in using it to estimate the population 
variance g°. Although a detailed explanation is beyond the scope of this text, it can be 
shown that if the sum of the squared deviations about the sample mean is divided by n — 1, 
and not n, the resulting sample variance provides an unbiased estimate of the population 
variance. For this reason, the sample variance, denoted by 3°, is defined as follows. 


The sample variance s? is 


a point estimator of the ee ee 27 
population variance o”. s = za aL (3.8) 
-= 


To illustrate the computation of the sample variance, we will use the data on class size 
for the sample of five college classes as presented in Section 3.1. A summary of the data, 
including the computation of the deviations about the mean and the squared deviations 
about the mean, is shown in Table 3.3. The sum of squared deviations about the mean is 
Sx; — x)’ = 256. Hence, with n — 1 = 4, the sample variance is 


4 BO ay. 256 
a. 7 
n-1 4 


= 64 


Before moving on, let us note that the units associated with the sample variance 
often cause confusion. Because the values being summed in the variance calculation, 
(i= x)’, are squared, the units associated with the sample variance are also squared. 
For instance, the sample variance for the class size data is s? = 64 (students)*. The 
squared units associated with variance make it difficult to develop an intuitive under- 
standing and interpretation of the numerical value of the variance. We recommend that 
you think of the variance as a measure useful in comparing the amount of variability 
for two or more variables. In a comparison of the variables, the one with the largest 
variance shows the most variability. Further interpretation of the value of the variance 
may not be necessary. 


The variance is useful in 
comparing the variability of 
two or more variables. 


TABLE 3.3 Computation of Deviations and Squared Deviations About the 


Mean for the Class Size Data 


Number of Squared Deviation 
Students in Mean Class Deviation About About the Mean 
Class (x; Size (x) the Mean (x; — x) (x, — x)? 

46 44 2 4 

54 44 10 100 

42 44 z2 4 

46 44 2 4 

32 44 A2 144 

6 256 
(x; — X) D(x, — x)? 
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TABLE 3.4 Computation of the Sample Variance for the Starting 


Salary Data 


Squared Deviation 


Sample Deviation About About the Mean 
Monthly Salary (x) Mean (x) the Mean (x; — x) (x; — x)? 

5850 5940 —90 8100 
5950 5940 10 100 
6050 5940 110 12,100 
5880 5940 —60 3600 
S755 5940 —185 34,225 
5710 5940 220 52,900 
5890 5940 —50 2500 
6130 5940 190 36,100 
5940 5940 0 0 
6325 5940 385 148,225 
5920 5940 —20 400 
5880 5940 —60 3600 

0 301,850 

X(x — x) X(x; — x)? 


Using equation (3.8), 


Sic x 2 
s Eno M 27,440.91 
m=i 14 


As another illustration of computing a sample variance, consider the starting salar- 
ies listed in Table 3.1 for the 12 business school graduates. In Section 3.1, we showed 
that the sample mean starting salary was 5940. The computation of the sample variance 
(s? = 27,440.91) is shown in Table 3.4. 

In Tables 3.3 and 3.4 we show both the sum of the deviations about the mean and the 
sum of the squared deviations about the mean. For any data set, the sum of the deviations 
about the mean will always equal zero. Note that in Tables 3.3 and 3.4, X$ (x; — x) = 0. 
The positive deviations and negative deviations cancel each other, causing the sum of the 
deviations about the mean to equal zero. 


Standard Deviation 


The standard deviation is defined to be the positive square root of the variance. Following 
the notation we adopted for a sample variance and a population variance, we use s to de- 
note the sample standard deviation and o to denote the population standard deviation. The 
standard deviation is derived from the variance in the following way. 


The sample standard 
deviation s is a point STANDARD DEVIATION 


estimator of the population Sample standard deviation = s = V/s? (3.9) 


tandard deviati ; 3 Aare 
rene GEAR Population standard deviation = o = Vo? (3.10) 


Recall that the sample variance for the sample of class sizes in five college classes is 
s? = 64. Thus, the sample standard deviation is s = \/64 = 8. For the data on starting 
salaries, the sample standard deviation is s = V 27,440.91 = 165.65. 

What is gained by converting the variance to its corresponding standard deviation? 
Recall that the units associated with the variance are squared. For example, the sample 
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The standard deviation variance for the starting salary data of business school graduates is s? = 27,440.91 (dol- 
is easier to interpret than lars)”. Because the standard deviation is the square root of the variance, the units of the 
the variance because variance, dollars squared, are converted to dollars in the standard deviation. Thus, the 


the standard deviation is standard deviation of the starting salary data is $165.65. In other words, the standard 

measured in the same units deviation is measured in the same units as the original data. For this reason the standard 

as the data. deviation is more easily compared to the mean and other statistics that are measured in the 
same units as the original data. 


Using Excel to Compute the Sample Variance and Sample 


Standard Deviation 


Excel provides functions for computing the sample variance and sample standard de- 
viation. We illustrate the use of these functions by computing the sample variance and 
sample standard deviation for the starting salary data in Table 3.1. Refer to Figure 3.6 as 
we describe the tasks involved. Figure 3.6 is an extension of Figure 3.2, where we showed 
how to use Excel functions to compute the mean, median, and mode. The formula work- 
sheet is in the background; the value worksheet is in the foreground. 


Enter/Access Data: Open the file StartingSalaries. The data are in cells B2:B13 and labels 
appear in column A and cell B1. 


Enter Functions and Formulas: Excel’s AVERAGE, MEDIAN, and MODE.SNGL func- 
tions were entered in cells E2:E4 as described earlier. Excel’s VAR.S function can be used 
to compute the sample variance by entering the following formula in cell E5: 


=VAR.S(B2:B13) 


(U 


DATA file 


StartingSalaries 


Similarly, the formula =STDEV.S(B2:B13) is entered in cell E6 to compute the sample 
standard deviation. 


The labels in cells D2:D6 identify the output. Note that the sample variance (27440.91) and the 
sample standard deviation (165.65) are the same as we computed earlier using the definitions. 


FIGURE 3.6 | Excel Worksheet Used to Compute the Sample Variance and the 
Sample Standard Deviation for the Starting Salary Data 


A A B c D E 
Monthly Starting 
1 Graduate Salary ($) 
21 Mean 
3/2 Median 
43 Mode 
5 4 Variance 
6 5 Standard Deviation 
7 6 
27 4A A 8 c D E 
948 Monthly Starting 
10 9 1 Graduate Salary ($) 
11 10 2 1 Mean 
12 11 3 2 Median 
13 12 4 3 Mode 
is 5 4 Variance 
6 5 Standard Deviation 
7 6 
8 7 
9 8 
10 9 
11 10 
12 11 
13 12 
14 
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The coefficient of variation 
is a relative measure of 
variability; it measures the 
standard deviation relative 
to the mean. 


(Q 


DATA file 


StartingSalaries 


Chapter 3 Descriptive Statistics: Numerical Measures 


Coefficient of Variation 


In some situations we may be interested in a descriptive statistic that indicates how large 
the standard deviation is relative to the mean. This measure is called the coefficient of 
variation and is usually expressed as a percentage. 


COEFFICIENT OF VARIATION 


Standard deviation 
Mean 


x 100% (3.11) 


For the class size data, we found a sample mean of 44 and a sample standard deviation 
of 8. The coefficient of variation is [(8/44) X 100]% = 18.2%. In words, the coefficient 
of variation tells us that the sample standard deviation is 18.2% of the value of the sample 
mean. For the starting salary data with a sample mean of 3940 and a sample standard 
deviation of 165.65, the coefficient of variation, [(165.65/3940) X 100]% = 4.2%, tells us 
the sample standard deviation is only 4.2% of the value of the sample mean. In general, the 
coefficient of variation is a useful statistic for comparing the variability of variables that 
have different standard deviations and different means. 


Using Excel's Descriptive Statistics Tool 


As we have seen, Excel provides statistical functions to compute descriptive statistics 
for a data set. These functions can be used to compute one statistic at a time (e.g., mean, 
variance). Excel also provides a variety of data analysis tools. One of these, called De- 
scriptive Statistics, allows the user to compute a variety of descriptive statistics at once. 
We will now show how Excel’s Descriptive Statistics tool can be used for the starting 
salary data in Table 3.1. Refer to Figure 3.7 as we describe the tasks involved. 


Enter/Access Data: Open the file StartingSalaries. The data are in cells B2:B13 and labels 
appear in column A and in cell B1. 


Apply Tools: The following steps describe how to use Excel’s Descriptive Statistics tool 
for these data. 


Step 1. Click the Data tab on the Ribbon 
Step 2. In the Analysis group, click Data Analysis 


FIGURE 3.7 | Dialog Box for Excel's Descriptive Statistics Tool 
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Descriptive Statistics Provided by Excel for the Starting 


Salary Data 
y A B Cc D E 
Monthly Starting MR a 

1 Graduate Salary ($) Monthly Starting Salary ($) 

2 1 5850 

3 2 5950 Mean 5940 
4 3 6050 Standard Error 47.8199 
5 4 5880 Median 5905 
6 5 5755 Mode 5880 
7 6 5710 Standard Deviation 165.65 
8 7 5890 Sample Variance 27440.91 
9 8 6130 Kurtosis 1,72 
10 9 5940 Skewness 1.09 
1110 6325 Range 615 
12 11 5920 Minimum 5710 
13. 12 5880 Maximum 6325 
14 Sum 71280 
15 Count 12 


Step 3. Choose Descriptive Statistics from the list of Analysis Tools 
Step 4. When the Descriptive Statistics dialog box appears (see Figure 3.7): 
Enter B1:B13 in the Input Range: box 
Select Columns in the Grouped By: area 
Select the check box for Labels in First Row 
Select Output Range: in the Output options area 
Enter D/ in the Output Range: box (to identify the upper left corner of the 
section of the worksheet where the descriptive statistics will appear) 
Select the check box for Summary Statistics 
Click OK 


Cells D1:D15 of Figure 3.8 show the descriptive statistics provided by Excel. The 
boldfaced entries are the descriptive statistics that we have already covered. The descriptive 
statistics that are not boldfaced are either covered subsequently in the text or discussed in 
more advanced texts. 


NOTES + COMMENTS 


1. Statistical software packages and spreadsheets can be 4. An alternative formula for the computation of the sample 


used to develop the descriptive statistics presented in variance is 

this chapter. After the data are entered into a worksheet, om xL- nx? 
= nm 

a few simple commands can be used to generate the n-1 

desired output. where S32 =52 $52 +e 52. 


2. Standard deviation is acommonly used measure of the risk 5 The mean absolutėerror (MAB) is another mèašure of vari- 


associated with investing in stock and stock funds. It pro- ability that is computed by summing the absolute values of 


vides a measure of how monthly returns fluctuate around the deviations of the observations about the mean and di- 


the long-run average return. viding this sum by the number of observations. For a sample 


3. Rounding the value of the sample mean X and the values of size n, the MAE is computed as follows: 


Elx- x] 


when calculating the variance and standard deviation. To MAE = = 


of the squared deviations (x, — X)? may introduce errors 


reduce rounding errors, we recommend carrying at least . : . z 
g ying For the class size data presented in Section 3.1, x = 44, 


> |x; — x| = 28, and the MAE = 28/5 = 5.6. You can learn 
more about the MAE and other measures of variability in 
Chapter 17. 


six significant digits during intermediate calculations. 
The resulting variance or standard deviation can then be 
rounded to fewer digits. 
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EXERCISES 


Methods 

23. Consider a sample with data values of 10, 20, 12, 17, and 16. Compute the range and 
interquartile range. 

24. Consider a sample with data values of 10, 20, 12, 17, and 16. Compute the variance 
and standard deviation. 

25. Consider a sample with data values of 27, 25, 20, 15, 30, 34, 28, and 25. Compute the 
range, interquartile range, variance, and standard deviation. 


Applications 

26. Price of Unleaded Gasoline. Data collected by the Oil Price Information Service from 
more than 90,000 gasoline and convenience stores throughout the United States showed 
that the average price for a gallon of unleaded gasoline was $3.28 (MSN Auto website). 
The following data show the price per gallon ($) for a sample of 20 gasoline and conveni- 
ence stores located in San Francisco. 


3.59 3.59 4.79 3.56 3.55 3.71 3.65 3.60 3.75 3.56 


S DAT file 3.57 359 355 399 415 366 363 373 361 3.57 
SFGasPrices 
a. Use the sample data to estimate the mean price for a gallon of unleaded gasoline in 
San Francisco. 
b. Compute the sample standard deviation. 
c. Compare the mean price per gallon for the sample data to the national average 
price. What conclusions can you draw about the cost of living in San Francisco? 
27. Round-Trip Flight Prices. The following table displays round-trip flight prices from 
14 major U.S. cities to Atlanta and Salt Lake City. 
Round-Trip Cost ($) 
Departure City Atlanta Salt Lake City 
Cincinnati 340.10 570.10 
New York 321.60 354.60 
Chicago 291.60 465.60 
S> R Denver 339.60 219.60 
= DATA file Los Angeles 359.60 311.60 
Flights Seattle 384.60 297.60 
Detroit 309.60 471.60 
Philadelphia 415.60 618.40 
Washington, DC 293.60 513.60 
Miami 249.60 523.20 
San Francisco 539.60 381.60 
Las Vegas 455.60 159.60 
Phoenix 359.60 267.60 
Dallas 333,90 458.60 


a. Compute the mean price for a round-trip flight into Atlanta and the mean price for 
a round-trip flight into Salt Lake City. Is Atlanta less expensive to fly into than Salt 
Lake City? If so, what could explain this difference? 

b. Compute the range, variance, and standard deviation for the two samples. What 
does this information tell you about the prices for flights into these two cities? 

28. Annual Sales Amounts. Varatta Enterprises sells industrial plumbing valves. The fol- 
lowing table lists the annual sales amounts for the different salespeople in the organi- 
zation for the most recent fiscal year. 
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Salesperson Sales Amount ($1000) Salesperson Sales Amount ($1000) 


a= . Joseph 147 Wei 465 
= DATA file ie 232 Samantha 410 
veer Phillip 547 Erin 298 
Stanley 328 Dominic 321 

Luke 295 Charlie 190 

Lexie 194 Amol 21 

Margaret 368 Lenisa 413 


a. Compute the mean, variance, and standard deviation for these annual sales values. 

b. In the previous fiscal year, the average annual sales amount was $300,000 with a 
standard deviation of $95,000. Discuss any differences you observe between the 
annual sales amount in the most recent and previous fiscal years. 

29. Air Quality Index. The Los Angeles Times regularly reports the air quality index for 
various areas of Southern California. A sample of air quality index values for Pomona 
provided the following data: 28, 42, 58, 48, 45, 55, 60, 49, and 50. 

a. Compute the range and interquartile range. 

b. Compute the sample variance and sample standard deviation. 

c. A sample of air quality index readings for Anaheim provided a sample mean of 
48.5, a sample variance of 136, and a sample standard deviation of 11.66. What 
comparisons can you make between the air quality in Pomona and that in Anaheim 
on the basis of these descriptive statistics? 

30. Reliability of Delivery Service. The following data were used to construct the histo- 
grams of the number of days required to fill orders for Dawson Supply, Inc. and J.C. 
Clark Distributors (see Figure 3.5). 


Dawson Supply Days for Delivery: 11 10 9 10 11 11 10 11 10 10 
Clark Distributors Days for Delivery: 8 10 13 7 10 11 10 7 15 12 


Use the range and standard deviation to support the previous observation that Dawson 
Supply provides the more consistent and reliable delivery times. 

31. Cellular Phone Spending. According to the 2016 Consumer Expenditure Survey, 
Americans spend an average of $1124 on cellular phone service annually (U.S. Bureau 
of Labor Statistics website). Suppose that we wish to determine if there are differences 
in cellular phone expenditures by age group. Therefore, samples of 10 consumers 
were selected for three age groups (18-34, 35—44, 45 and older). The annual expendi- 
ture for each person in the sample is provided in the table below. 


18-34 35-44 45 and Older 
1355 969 1135 
oe Ws) 434 956 
= i 1456 1792 400 
SDAN ile 2045 1500 1374 
1621 1277 1244 
994 1056 825 
1937 1922 763 
1200 1350 1192 
1390 1415 1510 


a. Compute the mean, variance, and standard deviation for each of these three samples. 
b. What observations can be made based on these data? 
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(Q 


: 32. Advertising Spend by Companies. Advertising Age annually compiles a list of the 

DATA file Orona ie 
panies that spend the most on advertising. Consumer-goods company Procter 
& Gamble has often topped the list, spending billions of dollars annually. Consider the 
data found in the file Advertising. It contains annual advertising expenditures for a 
sample of 20 companies in the automotive sector and 20 companies in the department 
store sector. 
What is the mean advertising spent for each sector? 
What is the standard deviation for each sector? 
What is the range of advertising spent for each sector? 
. What is the interquartile range for each sector? 
Based on this sample and your answers to parts (a) to (d), comment on any differences 
in the advertising spending in the automotive companies versus the department store 
companies. 
33. Amateur Golfer Scores. Scores turned in by an amateur golfer at the Bonita Fairways 
Golf Course in Bonita Springs, Florida, during 2018 and 2019 are as follows: 


2018 Season: 74 78 79 77 75 13 75 77 
2019 Season: 71 70 75 77 85 80 71 79 


Advertising 


PRPS 


a. Use the mean and standard deviation to evaluate the golfer’s performance over the 
two-year period. 
b. What is the primary difference in performance between 2018 and 2019? What 
improvement, if any, can be seen in the 2019 scores? 
34. Consistency of Running Times. The following times were recorded by the quarter- 
mile and mile runners of a university track team (times are in minutes). 


Quarter-Mile Times: 92. 98 1.04 .90 .99 
Mile Times: 4.52 4.35 4.60 4.70 4.50 


After viewing this sample of running times, one of the coaches commented that the 
quarter-milers turned in the more consistent times. Use the standard deviation and the 
coefficient of variation to summarize the variability in the data. Does the use of the 
coefficient of variation indicate that the coach’s statement should be qualified? 


3.3 Measures of Distribution Shape, Relative Location, 
and Detecting Outliers 


We have described several measures of location and variability for data. In addition, it is 
often important to have a measure of the shape of a distribution. In Chapter 2 we noted that 
a histogram provides a graphical display showing the shape of a distribution. An important 
numerical measure of the shape of a distribution is called skewness. 


Distribution Shape 


Excel uses the function Figure 3.9 shows four histograms constructed from relative frequency distributions. The 
SKEW to compute the histograms in Panels A and B are moderately skewed. The one in Panel A is skewed to the 
skewness of data. left; its skewness is —.85. The histogram in Panel B is skewed to the right; its skewness is 
+.85. The histogram in Panel C is symmetric; its skewness is zero. The histogram in Panel D 
is highly skewed to the right; its skewness is 1.62. The formula used to compute skewness 
is somewhat complex.' However, the skewness can easily be computed using statistical 
software. For data skewed to the left, the skewness is negative; for data skewed to the right, 
the skewness is positive. If the data are symmetric, the skewness is zero. 


'The formula for the skewness of sample data: 


| ey 
Skewness CEECEE >» z 
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FIGURE 3.9 Histograms Showing the Skewness for Four Distributions 


Panel A: Moderately Skewed Left Panel B: Moderately Skewed Right 
Skewness = —.85 Skewness = .85 


Panel C: Symmetric Panel D: Highly Skewed Right 
Skewness = 0 Skewness = 1.62 


0.35 


0.25 


For a symmetric distribution, the mean and the median are equal. When the data are 
positively skewed, the mean will usually be greater than the median; when the data are 
negatively skewed, the mean will usually be less than the median. The data used to con- 
struct the histogram in Panel D are customer purchases at a women’s apparel store. The 
mean purchase amount is $77.60 and the median purchase amount is $59.70. The rela- 
tively few large purchase amounts tend to increase the mean, whereas the median remains 
unaffected by the large purchase amounts. The median provides the preferred measure of 
location when the data are highly skewed. 


z-Scores 


In addition to measures of location, variability, and shape, we are also interested in the 
relative location of values within a data set. Measures of relative location help us determine 
how far a particular value is from the mean. 

By using both the mean and standard deviation, we can determine the relative 
location of any observation. Suppose we have a sample of n observations, with the 
values denoted by x,, x,,...,x,. In addition, assume that the sample mean, x, and the 
sample standard deviation, s, are already computed. Associated with each value, x,, is 
another value called its z-score. Equation (3.12) shows how the z-score is computed 
for each x;. 
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z-SCORE 


i (3.12) 


where 
z; = the z-score for x; 
x = the sample mean 
s = the sample standard deviation 


The z-score is often called the standardized value. The z-score, z, can be interpreted 
as the number of standard deviations x; is from the mean x. For example, z; = 1.2 would 
indicate that x, is 1.2 standard deviations greater than the sample mean. Similarly, z, = —.5 
would indicate that x, is .5, or 1/2, standard deviation less than the sample mean. A z-score 
greater than zero occurs for observations with a value greater than the mean, and a z-score 
less than zero occurs for observations with a value less than the mean. A z-score of zero 
indicates that the value of the observation is equal to the mean. 

The z-score for any observation can be interpreted as a measure of the relative location 
of the observation in a data set. Thus, observations in two different data sets with the same 
z-score can be said to have the same relative location in terms of being the same number of 
standard deviations from their respective means. 

The z-scores for the class size data from Section 3.1 are computed in Table 3.5. Recall 
the previously computed sample mean, x = 44, and sample standard deviation, s = 8. The 
z-score of — 1.50 for the fifth observation shows it is farthest from the mean; it is 1.50 stan- 
dard deviations below the mean. Figure 3.10 provides a dot plot of the class size data with 
a graphical representation of the associated z-scores on the axis below. 


Chebyshev’s Theorem 


Chebyshev’s theorem enables us to make statements about the proportion of data values 
that must be within a specified number of standard deviations of the mean. 


CHEBYSHEV’S THEOREM 


At least (1 — 1/z°) of the data values must be within z standard deviations of the mean, 
where z is any value greater than 1. 


Some of the implications of this theorem, with z = 2, 3, and 4 standard deviations, follow. 


e At least .75, or 75%, of the data values must be within z = 2 standard deviations of 


the mean. 
e At least .89, or 89%, of the data values must be within z = 3 standard deviations of 
the mean. 
e At least .94, or 94%, of the data values must be within z = 4 standard deviations of 
the mean. 
TABLE 3.5  z-Scores for the Class Size Data 
Number of Deviation 7 
i X-X 
Students in About the Mean Bora 
Class (x;) (x; — x) 
46 2 2/8 = -25 
54 10 1O/B= 1025 
42 —2 —2/81— 25) 
46 2 2/8 = 25 
32 212 SA} SS NO) 
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FIGURE 3.10 Dot Plot Showing Class Size Data and z-Scores 
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Chebyshev’s theorem 
requires z > 1, but z need 
not be an integer. 


The empirical rule is based 
on the normal probability 
distribution, which will be 
discussed in Chapter 6. 
The normal distribution is 
used extensively through- 
out the text. 


For an example using Chebyshev’s theorem, suppose that the midterm test scores for 
100 students in a college business statistics course had a mean of 70 and a standard devia- 
tion of 5. How many students had test scores between 60 and 80? How many students had 
test scores between 58 and 82? 

For the test scores between 60 and 80, we note that 60 is two standard deviations below the 
mean and 80 is two standard deviations above the mean. Using Chebyshev’s theorem, we see 
that at least .75, or at least 75%, of the observations must have values within two standard de- 
viations of the mean. Thus, at least 75% of the students must have scored between 60 and 80. 

For the test scores between 58 and 82, we see that (58 — 70)/5 = —2.4 indicates 58 is 
2.4 standard deviations below the mean and that (82 — 70)/5 = +2.4 indicates 82 is 2.4 
standard deviations above the mean. Applying Chebyshev’s theorem with z = 2.4, we have 


1 1 
(No(s oho as 
Zz (2.4) 


At least 82.6% of the students must have test scores between 58 and 82. 


Empirical Rule 


One of the advantages of Chebyshev’s theorem is that it applies to any data set regardless of 

the shape of the distribution of the data. Indeed, it could be used with any of the distributions 

in Figure 3.9. In many practical applications, however, data sets exhibit a symmetric mound- 
shaped or bell-shaped distribution like the one shown in Figure 3.11. When the data are believed 
to approximate this distribution, the empirical rule can be used to determine the percentage of 
data values that must be within a specified number of standard deviations of the mean. 


EMPIRICAL RULE 
For data having a bell-shaped distribution: 


e Approximately 68% of the data values will be within one standard deviation of 
the mean. 

e Approximately 95% of the data values will be within two standard deviations of 
the mean. 

e Almost all (approximately 99.7%) of the data values will be within three standard 
deviations of the mean. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


134 Chapter 3 Descriptive Statistics: Numerical Measures 


FIGURE 3.11 A Symmetric Mound-Shaped or Bell-Shaped Distribution 


For example, liquid detergent cartons are filled automatically on a production line. 
Filling weights frequently have a bell-shaped distribution. If the mean filling weight is 
16 ounces and the standard deviation is .25 ounces, we can use the empirical rule to draw 
the following conclusions: 


e Approximately 68% of the filled cartons will have weights between 15.75 and 16.25 
ounces (within one standard deviation of the mean). 

e Approximately 95% of the filled cartons will have weights between 15.50 and 16.50 
ounces (within two standard deviations of the mean). 

e Almost all (approximately 99.7%) filled cartons will have weights between 15.25 
and 16.75 ounces (within three standard deviations of the mean). 


Detecting Outliers 


Sometimes a data set will have one or more observations with unusually large or unusually 
small values. These extreme values are called outliers. Experienced statisticians take steps 
to identify outliers and then review each one carefully. An outlier may be a data value that 
has been incorrectly recorded. If so, it can be corrected before further analysis. An outlier 
may also be from an observation that was incorrectly included in the data set; if so, it can 
be removed. Finally, an outlier may be an unusual data value that has been recorded cor- 
rectly and belongs in the data set. In such cases it should remain. 

Standardized values (z-scores) can be used to identify outliers. Recall that the empiri- 
cal rule allows us to conclude that for data with a bell-shaped distribution, almost all the 
data values will be within three standard deviations of the mean. Hence, in using z-scores 
to identify outliers, we recommend treating any data value with a z-score less than —3 or 
greater than +3 as an outlier. Such data values can then be reviewed for accuracy and to 
determine whether they belong in the data set. 

Refer to the z-scores for the class size data in Table 3.5. The z-score of — 1.50 shows the 
fifth class size is farthest from the mean. However, this standardized value is well within 
the —3 to +3 guideline for outliers. Thus, the z-scores do not indicate that outliers are 
present in the class size data. 

Another approach to identifying outliers is based upon the values of the first and third 
quartiles (Q, and Q,) and the interquartile range (IQR). Using this method, we first com- 
pute the following lower and upper limits: 


Lower Limit = Q, — 1.5(1QR) 
Upper Limit = Q, + 1.5(IQR) 
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The approach that uses the An observation is classified as an outlier if its value is less than the lower limit or greater 
first and third quartiles and than the upper limit. For the monthly starting salary data shown in Table 3.1, Q, = 5857.5, 
the IQR to identify outliers Q, = 6025, IQR = 167.5, and the lower and upper limits are 


d il 
P Lower Limit = Q, — 1.5(IQR) = 5857.5 — 1.5(167.5) = 5606.25 
Upper Limit = Q, + 1.5(IQR) = 6025 + 1.5(167.5) = 6276.25 


provide the same results as 
the approach based upon 
a z-score less than —3 or 
greater than +3. Either 

or both procedures may 


Looking at the data in Table 3.1 we see that there are no observations with a starting salary 
less than the lower limit of 5606.25. But there is one starting salary, 6325, that is greater 
than the upper limit of 6276.25. Thus, 6325 is considered to be an outlier using this altern- 
ate approach to identifying outliers. 


NOTES + COMMENTS 


1. Chebyshev's theorem is applicable for any data set and 


be used. 


Chebyshev's theorem allows us to conclude only that at 


can be used to state the minimum number of data values 
that will be within a certain number of standard deviations 
of the mean. If the data are known to be approximately 
bell-shaped, more can be said. For instance, the empirical 
rule allows us to say that approximately 95% of the data 
values will be within two standard deviations of the mean; 


least 75% of the data values will be in that interval. 


. Before analyzing a data set, statisticians usually make a vari- 


ety of checks to ensure the validity of data. In a large study 
it is not uncommon for errors to be made in recording data 
values or in entering the values into a computer. Identifying 
outliers is one tool used to check the validity of the data. 


EXERCISES 


Methods 

35. Consider a sample with data values of 10, 20, 12, 17, and 16. Compute the z-score for 
each of the five observations. 

36. Consider a sample with a mean of 500 and a standard deviation of 100. What are the 
z-scores for the following data values: 520, 650, 500, 450, and 280? 

37. Consider a sample with a mean of 30 and a standard deviation of 5. Use Chebyshev’s 
theorem to determine the percentage of the data within each of the following ranges: 

20 to 40 

15 to 45 

22 to 38 

. 18 to 42 

12 to 48 

38. Suppose the data have a bell-shaped distribution with a mean of 30 and a standard 
deviation of 5. Use the empirical rule to determine the percentage of data within each 
of the following ranges: 
a. 20 to 40 
b. 15 to 45 
c. 25 to 35 


one se 


Applications 
39. Amount of Sleep per Night. The results of a national survey showed that on average, 
adults sleep 6.9 hours per night. Suppose that the standard deviation is 1.2 hours. 
a. Use Chebyshev’s theorem to calculate the percentage of individuals who sleep 
between 4.5 and 9.3 hours. 
b. Use Chebyshev’s theorem to calculate the percentage of individuals who sleep 
between 3.9 and 9.9 hours. 
c. Assume that the number of hours of sleep follows a bell-shaped distribution. Use 
the empirical rule to calculate the percentage of individuals who sleep between 4.5 
and 9.3 hours per day. How does this result compare to the value that you obtained 
using Chebyshev’s theorem in part (a)? 
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40. Price per Gallon of Gasoline. Suppose that mean retail price per gallon of regular- 
grade gasoline is $3.43 with a standard deviation of $.10 and that the retail price per 
gallon has a bell-shaped distribution. 

a. What percentage of regular grade gasoline sells for between $3.33 and $3.53 per gallon? 
b. What percentage of regular grade gasoline sells for between $3.33 and $3.63 per gallon? 
c. What percentage of regular grade gasoline sells for more than $3.63 per gallon? 

41. GMAT Exam Scores. The Graduate Management Admission Test (GMAT) is a stan- 
dardized exam used by many universities as part of the assessment for admission to grad- 
uate study in business. The average GMAT score is 547 (Magoosh website). Assume that 
GMAT scores are bell-shaped with a standard deviation of 100. 

a. What percentage of GMAT scores are 647 or higher? 
b. What percentage of GMAT scores are 747 or higher? 
c. What percentage of GMAT scores are between 447 and 547? 
d. What percentage of GMAT scores are between 347 and 647? 

42. Cost of Backyard Structures. Many families in California are using backyard struc- 
tures for home offices, art studios, and hobby areas as well as for additional storage. 
Suppose that the mean price for a customized wooden, shingled backyard structure is 
$3100. Assume that the standard deviation is $1200. 

a. What is the z-score for a backyard structure costing $2300? 

b. What is the z-score for a backyard structure costing $4900? 

c. Interpret the z-scores in parts (a) and (b). Comment on whether either should be 
considered an outlier. 

d. If the cost for a backyard shed-office combination built in Albany, California, is 
$13,000, should this structure be considered an outlier? Explain. 

43. Best Places to Live. Each year Money magazine publishes a list of “Best Places 
to Live in the United States.” These listings are based on affordability, educational 


Median Household Median Household 

City Income ($) City Income ($) 
Pelham, AL 66,772 Bozeman, MT 49 303 
Juneau, AK 84,101 Papillion, NE TOA 
; Paradise Valley, AZ 138,192 Sparks, NV 54,230 
DATA file Fayetteville, AR 40,835 cee NH 66,872 
BestCities Monterey Park, CA 57,419 North Arlington, NJ 73,885 
Lone Tree, CO 116,761 Rio Rancho, NM 58,982 
Manchester, CT 64,828 Valley Stream, NY 88,693 
Hockessin, DE 115,124 Concord, NC 54,579 
St. Augustine, FL 47,748 Dickinson, ND 71,866 
Vinings, GA T3108 Wooster, OH 43,054 
Kapaa, HI 62,546 Mustang, OK 66,714 
Meridian, ID 62,899 Beaverton, OR 58,785 
Schaumburg, IL 73,824 Lower Merion, PA 117,438 
Fishers, IN 87,043 Warwick, RI 63,414 
Council Bluffs, IA 46,844 Mauldin, SC 57,480 
Lenexa, KS 76,505 Rapid City, SD 47,788 
Georgetown, KY 58,709 Franklin, TN 82,334 
Bossier City, LA 47,051 Allen, TX 104,524 
South Portland, ME 56,472 Orem, UT SA SIS 
Rockville, MD 100,158 Colchester, VT 69,181 
Waltham, MA 75,106 Reston, VA 112722 
Farmington Hills, MI 71,154 Mercer Island, WA 128,484 
Woodbury, MN 99,657 Morgantown, WV 38,060 
Olive Branch, MS 62,958 New Berlin, WI 74,983 
St. Peters, MO 57,728 Cheyenne, WY 56,529 
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performance, convenience, safety, and livability. The list on previous page shows the 
median household income of Money magazine’s top city in each U.S. state for 2017 
(Money magazine website). 
a. Compute the mean and median for these household income data. 
b. Compare the mean and median values for these data. What does this indicate about 
the distribution of household income data? 

c. Compute the range and standard deviation for these household income data. 
d. Compute the first and third quartiles for these household income data. 
e. Are there any outliers in these data? What does this suggest about the data? 

44. NCAA Basketball Game Scores. A sample of 10 NCAA college basketball game 
scores provided the following data. 


Winning 

Winning Team Points Losing Team Points Margin 
Arizona 90 Oregon 66 24 
Duke 85 Georgetown 66 19 
> : Florida State 75 Wake Forest 70 5 
= DATA file Kansas 78 Colorado 57 21 
NGAA Kentucky 71 Notre Dame 63 8 
Louisville 65 Tennessee 62 2 
Oklahoma State 72 Texas 66 6 
Purdue 76 Michigan State 70 6 
Stanford Ui Southern Cal 67 10 
Wisconsin 76 Illinois 56 20 


a. Compute the mean and standard deviation for the points scored by the winning 
team. 

b. Assume that the points scored by the winning teams for all NCAA games follow a 
bell-shaped distribution. Using the mean and standard deviation found in part (a), 
estimate the percentage of all NCAA games in which the winning team scores 84 or 
more points. Estimate the percentage of NCAA games in which the winning team 
scores more than 90 points. 

c. Compute the mean and standard deviation for the winning margin. Do the data 
contain outliers? Explain. 

45. Apple iPads in Schools. The New York Times reported that Apple has unveiled a 
new iPad marketed specifically to school districts for use by students (The New 
York Times website). The 9.7-inch iPads will have faster processors and a cheaper 
price point in an effort to take market share away from Google Chromebooks in 
public school districts. Suppose that the following data represent the percentages 
of students currently using Apple iPads for a sample of 18 U.S. public school 
districts. 


15 22 12 21 26 18 42 29 64 20 15 22 18 24 27 
24 26 19 


a. Compute the mean and median percentage of students currently using Apple iPads. 
b. Compare the first and third quartiles for these data. 

c. Compute the range and interquartile range for these data. 

d. Compute the variance and standard deviation for these data. 

e 

f 


DATA file 


iPads 


(@ 


. Are there any outliers in these data? 
. Based on your calculated values, what can we say about the percentage of students 
using iPads in public school districts? 
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3.4 Five-Number Summaries and Boxplots 


Summary statistics and easy-to-draw graphs based on summary statistics can be used to 
quickly summarize large quantities of data. In this section we show how five-number sum- 
maries and boxplots can be developed to identify several characteristics of a data set. 


Five-Number Summary 


In a five-number summary, five numbers are used to summarize the data: 


1. Smallest value 

2. First quartile (Q,) 
3. Median (Q,) 

4. Third quartile (Q,) 
5. Largest value 


To illustrate the development of a five-number summary, we will use the monthly start- 
ing salary data in Table 3.1. Arranging the data in ascending order, we obtain the following 
results: 


5710 5755 5850 5880 5880 5890 5920 5940 5950 6050 6130 6325 


The smallest value is 5710 and the largest value is 6325. We showed how to compute the 
quartiles (Q, = 5857.5; Q, = 5905; and Q, = 6025) in Section 3.1. Thus, the five-number 
summary for the monthly starting salary data is 


5710 5857.5 5905 6025 6325 


The five-number summary indicates that the starting salaries in the sample are between 
5710 and 6325 and that the median or middle value is 5905. The first and third quartiles 
show that approximately 50% of the starting salaries are between 5857.5 and 6025. 


Boxplot 


A boxplot is a graphical display of data based on a five-number summary. A key to the 
development of a boxplot is the computation of the interquartile range, IQR = Q,—Q,. 
Figure 3.12 shows a boxplot for the monthly starting salary data. The steps used to 
construct the boxplot follow. 


1. A box is drawn with the ends of the box located at the first and third quartiles. For 
the salary data, Q, = 5857.5 and Q, = 6025. This box contains the middle 50% of 
the data. 

2. A vertical line is drawn in the box at the location of the median (5905 for the 
salary data). 


Boxplots provide another 3. By using the interquartile range, IQR = Q, — Q,, limits are located at 1.5(IQR) 
way to identify outliers, below Q,, and 1.5(IQR) above Q,. For the salary data, IQR = Q, — Q, = 6025 
but they do not necessarily 5857.5 = 167.5. Thus, the limits are 5857.5 — 1.5(167.5) = 5606.25 and 6025 + 
identify the same values 1.5(167.5) = 6276.25. Data outside these limits are considered outliers. 

as those with a z-score less 4. The horizontal lines extending from each end of the box in Figure 3.12 are called 
than —3 or greater than whiskers. The whiskers are drawn from the ends of the box to the smallest and 

+3. Either or both proced- largest values inside the limits computed in step 3. Thus, the whiskers end at salary 
ures may be used. values of 5710 and 6130. 


5. Finally, the location of each outlier is shown with a small dot. In Figure 3.12 we see 
one outlier, 6325. 


In Figure 3.12 we included lines showing the location of the upper and lower limits. 
These lines were drawn to show how the limits are computed and where they are 
located. Although the limits are always computed, generally they are not drawn on the 
boxplots. 
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In Excel, a boxplot is 
referred to as a box and 
whisker plot. 


S DATA file 


StartingSalaries 
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FIGURE 3.12 Boxplot of the Monthly Starting Salary Data with Lines Showing 
the Lower and Upper Limits 
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Using Excel to Construct a Boxplot 


We can use Excel’s Insert Statistic Chart to construct a boxplot of the monthly starting 
salary data as outlined below. 


Enter/Access Data: Open the file StartingSalaries. The data are in cells B2:B13. 


Apply Tools: The following steps describe how to use Excel’s Insert Statistic Chart to 
construct a histogram of the audit time data. 


Step 1. Select cells in the data set (B2:B13) 

Step 2. Click the Insert tab on the Ribbon 

Step 3. In the Charts group, click Insert Statistic Chart alls ~ and then click Box and 
Whisker; the boxplot appears in the spreadsheet 


Editing Options: 


Step 1. Click on Chart Title and press the Delete key 
Step 2. Click on the 1 below the horizontal axis and press the Delete key 
Step 3. Click on the Chart Elements button * (located next to the top right corner of 
the chart) 
Step 4. When the list of chart elements appears: 
Deselect Gridlines to remove the horizontal gridlines from the chart 
Click Axis Titles to create placeholders for the axis titles 
Click on the horizontal Axis Title and press the Delete key 
Click on the vertical Axis Title placeholder and replace it with Monthly 
Starting Salary ($) 
Step 5. Right-click on the vertical axis, select Format Axis... 
Step 6. In the Format Axis task pane, select Tick Marks, and from the drop-down 
Major type menu select Inside 


Figure 3.13 shows the resulting boxplot. 


Comparative Analysis Using Boxplots 


Boxplots can also be used to provide a graphical summary of two or more groups and 
facilitate visual comparisons among the groups. For example, suppose the placement office 
decided to conduct a follow-up study to compare monthly starting salaries by the gradu- 
ate’s major: accounting, finance, information systems, management, and marketing. The 
major and starting salary data for a new sample of 111 recent business school graduates are 
shown in the file MajorSalaries. 
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FIGURE 3.13 | The Edited Boxplot Created in Excel for the Monthly Starting 
Salary Data 
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Using Excel to Construct a Comparative Analysis 
Using Boxplots 


We can use Excel’s Insert Statistic Chart to construct a comparative boxplot of data on the 
monthly starting salary by major as outlined below. 


= DATA file Enter/Access Data: Open the file MajorSalaries. The data are in cells A2:B112. 


MajorSalaries 7 g ko 
Apply Tools: The following steps describe how to use Excel’s Insert Statistic Chart to 


construct boxplots of monthly salary by major. 


Step 1. Select cells in the data set (A2:B112). 

Step 2. Click the Insert tab on the Ribbon 

Step 3. In the Charts group click Insert Statistic Chart gl, ~ and then click Box and 
Whisker; the boxplot appears in the spreadsheet 


Editing Options: 


Step 1. Click on Chart Title and replace it with Comparative Analysis of Monthly Start- 
ing Salary by Major 
Step 2. To put the majors in alphabetical order from left to right: 
Select cells in the data set (A2:B112) 
Select the Data tab on the Ribbon 
Select Sort from the Sort & Filter group 
From Sort by drop-down menu in the Sort dialog box, select Major 
From the Order drop-down menu in the Sort dialog box, select A to Z 
Click OK 
Step 3. Select the chart by clicking anywhere on the chart. Click on the Chart Elements 
button * (located next to the top right corner of the chart) 
Step 4. When the list of chart elements appears: 
Deselect Gridlines to remove the horizontal gridlines from the chart 
Click Axis Titles to create placeholders for the axis titles 
Click on the horizontal Axis Title placeholder and replace it with Major 
Click on the vertical Axis Title placeholder and replace it with Monthly 
Starting Salary ($) 
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FIGURE 3.14 | The Edited Comparative Boxplots Created in Excel for the Monthly 


Starting Salary by Major Data 
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Step 5. Right-click on the horizontal axis, select Format Axis... 
Step 6. In the Format Axis task pane 
Select Axis Options and enter 4000 for Minimum 
Select Tick Marks, and from the drop-down Major type, select Inside 


Figure 3.14 shows the resulting boxplot comparative analysis. 
What interpretations can you make from the boxplots in Figure 3.14? Specifically, we 
note the following: 


e The higher salaries are in accounting; the lower salaries are in management and 
marketing. 

e Based on the medians, accounting and information systems have similar and higher 
median salaries than the other majors. Finance is next, with marketing and manage- 
ment showing lower median salaries. 

e High salary outliers exist for accounting, finance, and marketing majors. 


Perhaps you can see additional interpretations based on these boxplots. 


NOTES + COMMENTS 


1. There are several options with Excel's Box and Whisker categories and create a line connecting the means of 
chart. To invoke these options, right-click on the box the different categories. 
part of the chart, select Format Data Series...,and the 2. In the Format Data Series task pane, there are two op- 
Format Data Series task pane will appear. This allows tions for how quartiles are calculated: Inclusive median 
you to control what appears in the chart—for exam- and Exclusive median. The default is Exclusive median; 
ple, whether or not to show the mean marker, markers this option is consistent with the approach for calculating 
for outliers, and markers for all points. In the case of a quartiles discussed in this text. We recommend that you 
comparative chart, you can control the gaps between leave this option at its default value of Exclusive median. 
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EXERCISES 
Methods 
46. Consider a sample with data values of 27, 25, 20, 15, 30, 34, 28, and 25. Provide the 
five-number summary for the data. 
47. Show the boxplot for the data in exercise 46. 
48. Show the five-number summary and the boxplot for the following data: 5, 15, 18, 10, 
8, 12, 16, 10, 6. 
49. A data set has a first quartile of 42 and a third quartile of 50. Compute the lower and upper 
limits for the corresponding boxplot. Should a data value of 65 be considered an outlier? 
Applications 
50. Naples Half-Marathon Times. Naples, Florida, hosts a half-marathon (13.1-mile 
running race) in January each year. The event attracts top runners from throughout 
the United States as well as from around the world. In the race results shown below, 
22 men and 31 women entered the 19-24 age class. Finish times in minutes are as 
follows. Times are shown in order of finish. 
DATA file Finish Men Women Finish Men Women Finish Men Women 
Runners 1 65.30 109.03 11 109.05 123.88 21 143.83 1675 
2 66.27 122 12 110.23 125.78 22 148.70 138.20 
3 66.52 111.65 13 1220 129152 23 139.00 
4 66.85 Tes 14 152 129.87 24 147.18 
5 70.87 114.38 15 120.95 130.72 25 147.35 
6 87.18 S233 16 127.98 131.67 26 147.50 
7 96.45 VANS 17 128.40 132.03 27 147.75 
8 98.52 122.08 18 130.90 15320 28 153.88 
9 100.52 122.48 19 131.80 133.50 29. 154.83 
10 108.18 122.62 20 138.63 1396:57 30 189.27 
31 189.28 
a. George Towett of Marietta, Georgia, finished in first place for the men and Lauren 
Wald of Gainesville, Florida, finished in first place for the women. Compare the 
first-place finish times for men and women. If the 53 men and women runners had 
competed as one group, in what place would Lauren have finished? 
b. What is the median time for men and women runners? Compare men and women 
runners based on their median times. 
c. Provide a five-number summary for both the men and the women. 
d. Are there outliers in either group? 
e. Show the boxplots for the two groups. Did men or women have the most variation 
in finish times? Explain. 
51. Pharmaceutical Company Sales. Annual sales, in millions of dollars, for 21 pharma- 
ceutical companies follow. 
; 8,408 1,374 1,872 8,879 2,459 11,413 
DATA file 608 14,138 6452 1850 2818 1,356 


PharmacySales 
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10,498 7,478 4,019 4,341 739 2,127 
3,653 5,794 8,305 


Provide a five-number summary. 

Compute the lower and upper limits. 

Do the data contain any outliers? 

. Johnson & Johnson’s sales are the largest on the list at $14,138 million. Suppose 
a data entry error (a transposition) had been made and the sales had been entered 


aoe 
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as $41,138 million. Would the method of detecting outliers in part (c) identify this 
problem and allow for correction of the data entry error? 
e. Show a boxplot. 

52. Cell Phone Companies Customer Satisfaction. Consumer Reports provided over- 
all customer satisfaction scores for AT&T, Sprint, T-Mobile, and Verizon cell phone 
services in major metropolitan areas throughout the United States. The rating for 
each service reflects the overall customer satisfaction considering a variety of factors 
such as cost, connectivity problems, dropped calls, static interference, and customer 
support. A satisfaction scale from 0 to 100 was used with 0 indicating completely 
dissatisfied and 100 indicating completely satisfied. The ratings for the four cell phone 
services in 20 metropolitan areas are as shown. 


Metropolitan Area AT&T Sprint T-Mobile Verizon 
Atlanta 70 66 7A Tg 
Boston 69 64 74 76 
Chicago 71 65 70 7 
Dallas 75 65 74 78 
Denver 7 67 73 HY 
Detroit Us) 65 77 79 
> ‘ Jacksonville 73 64 75 81 
= DATA file Las Vegas We 68 74 81 
Calpervics Los Angeles 66 65 68 78 
Miami 68 69 73 80 
Minneapolis 68 66 75 77 
Philadelphia 72 66 71 78 
Phoenix 68 66 76 81 
San Antonio 75 65 75 80 
San Diego 69 68 V2 79 
San Francisco 66 69 73 75 
Seattle 68 67 74 77 
St. Louis 74 66 74 79 
Tampa 73 63 73 HS) 
Washington 72 68 71 76 
a. Consider T-Mobile first. What is the median rating? 
b. Develop a five-number summary for the T-Mobile service. 
c. Are there outliers for T-Mobile? Explain. 
d. Repeat parts (b) and (c) for the other three cell phone services. 
e. Show the boxplots for the four cell phone services on one graph. Discuss 
what a comparison of the boxplots tells about the four services. Which service 
did Consumer Reports recommend as being best in terms of overall customer 
satisfaction? 
> : 53. Most Admired Companies. Fortune magazine’s list of the world’s most admired 
= DATA file odfananies C2014 nee i as , 
panies for 2014 is provided in the data contained in the file AdmiredCompanies 
AdmiredCompanios (Fortune magazine website). The data in the column labeled “Return” shows the 


one-year total return (%) for the top-ranked 50 companies. For the same time period 

the S&P average return was 18.4%. 

a. Compute the median return for the top-ranked 50 companies. 

b. What percentage of the top-ranked 50 companies had a one-year return greater than 
the S&P average return? 

c. Develop the five-number summary for the data. 

d. Are there any outliers? 

e. Develop a boxplot for the one-year total return. 
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54. U.S. Border Crossings. The Bureau of Transportation Statistics keeps track of all 
border crossings through ports of entry along the U.S.-Canadian and U.S.—Mexican 
borders. The data contained in the file BorderCrossings show the most recently pub- 
lished figures for the number of personal vehicle crossings (rounded to the nearest 
1000) at the 50 busiest ports of entry during the month of August (U.S. Department 
of Transportation website). 

. What are the mean and median number of crossings for these ports of entry? 

b. What are the first and third quartiles? 

c. Provide a five-number summary. 

d. Do the data contain any outliers? Show a boxplot. 


fet) 


3.5 Measures of Association Between Two Variables 


Thus far we have examined numerical methods used to summarize the data for one variable 
at a time. Often a manager or decision maker is interested in the relationship between two 
variables. In this section we present covariance and correlation as descriptive measures of 
the relationship between two variables. 

We begin by reconsidering the application concerning an electronics store in San Francisco 
as presented in Section 2.4. The store’s manager wants to determine the relationship between 
the number of weekend television commercials shown and the sales at the store during the 
following week. Sample data with sales expressed in hundreds of dollars are provided in 
Table 3.6. It shows 10 observations (n = 10), one for each week. The scatter diagram in 
Figure 3.15 shows a positive relationship, with higher sales (y) associated with a greater num- 
ber of commercials (x). In fact, the scatter diagram suggests that a straight line could be used 
as an approximation of the relationship. In the following discussion, we introduce covariance 
as a descriptive measure of the linear association between two variables. 


Covariance 


For a sample of size n with the observations (x,, y,), (x2, Y2), and so on, the sample covari- 
ance is defined as follows: 


SAMPLE COVARIANCE 


DE = DIS) 

y= i (3.13) 

Number of Commercials Sales ($100s) 
Week x y 
1 2 50 
2 5 57 
8 1 41 
4 3 54 
5 4 54 
6 1 38 
7 5 63 
8 3 48 
9 4 59 
10 2 46 
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FIGURE 3.15 Scatter Diagram for the San Francisco Electronics Store 


Sales ($100s) 
Nn 
S 
e 


0 1 P 3 4 5 
Number of Commercials 


This formula pairs each x; with a y; We then sum the products obtained by multiplying the 
deviation of each x; from its sample mean x by the deviation of the corresponding y; from 
its sample mean y; this sum is then divided by n — 1. 

To measure the strength of the linear relationship between the number of commercials 
x and the sales volume y in the San Francisco electronics store problem, we use equa- 
tion (3.13) to compute the sample covariance. The calculations in Table 3.7 show the 
computation of X(x; — x)(y; — y). Note that x = 30/10 = 3 and y = 510/10 = 51. Using 
equation (3.13), we obtain a sample covariance of 


_ XQ; — *)0; — Y) _ 99 


Se, =11 
i n-1 9 
xj yi XX =y (x; — xy- y) 
2 50 =] =| 1 
5 S 2 6 12 
1 41 =2 =A 20 
3 54 0 3 
4 54 1 3 3 
1 38 =f =i 26 
5 63 2 12 24 
3 48 (0) =3 
4 59 1 8 
2 46 -1 -5 
Totals 30 510 0 0 99 
: D(x — xy; — y) 99 K 
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The formula for computing the covariance of a population of size N is similar to equa- 
tion (3.13), but we use different notation to indicate that we are working with the entire 
population. 


POPULATION COVARIANCE 


E&i = HDC = My) 
Oo S N (3.14) 


In equation (3.14) we use the notation u, for the population mean of the variable x and u, 
for the population mean of the variable y. The population covariance o,,, is defined for a 
population of size N. 


Interpretation of the Covariance 


To aid in the interpretation of the sample covariance, consider Figure 3.16. It is the same 
as the scatter diagram of Figure 3.15, with a vertical dashed line at x = 3 and a horizontal 
dashed line at y = 51. The lines divide the graph into four quadrants. Points in quadrant 

I correspond to x, greater than x and y, greater than y, points in quadrant II correspond to 
x; less than x and y, greater than y, and so on. Thus, the value of (x; — x)(y; — y) must be 
positive for points in quadrant I, negative for points in quadrant II, positive for points in 
quadrant III, and negative for points in quadrant IV. 


The covariance is a If the value of s, is positive, the points with the greatest influence on s,., must be in 
measure of the linear quadrants I and III. Hence, a positive value for s,, indicates a positive linear association 
association between two between x and y; that is, as the value of x increases, the value of y increases. If the value of 
variables. S is negative, however, the points with the greatest influence on s,, are in quadrants I and 


IV. Hence, a negative value for S, indicates a negative linear association between x and y; 
that is, as the value of x increases, the value of y decreases. Finally, if the points are evenly 
distributed across all four quadrants, the value of s,., will be close to zero, indicating no 
linear association between x and y. Figure 3.17 shows the values of S,, that can be expected 
with three different types of scatter diagrams. 


FIGURE 3.16 Partitioned Scatter Diagram for the San Francisco 


Electronics Store 


Sales ($100s) 
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Number of Commercials 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


3.5 Measures of Association Between Two Variables 147 


RE 3.17 Interpretation of Sample Covariance 


Syy Positive: 
(x and y are positively ee 
linearly related) 


° 
e © 
° o 
5 y 
Syy Approximately 0: 
(x and y are not 
linearly related) 
e ° e 
e 
e ° e e 


Sxy Negative: 
(x and y are negatively e 
linearly related) 
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Referring again to Figure 3.16, we see that the scatter diagram for the electronics store 
follows the pattern in the top panel of Figure 3.17. As we should expect, the value of the 
sample covariance indicates a positive linear relationship with s, = 11. 

From the preceding discussion, it might appear that a large positive value for the co- 
variance indicates a strong positive linear relationship and that a large negative value indi- 
cates a strong negative linear relationship. However, one problem with using covariance 
as a measure of the strength of the linear relationship is that the value of the covariance 
depends on the units of measurement for x and y. For example, suppose we are interested 
in the relationship between height x and weight y for individuals. Clearly the strength of 
the relationship should be the same whether we measure height in feet or inches. Measur- 
ing the height in inches, however, gives us much larger numerical values for (x; — x) than 
when we measure height in feet. Thus, with height measured in inches, we would obtain 
a larger value for the numerator o(x; — x)(y; — y) in equation (3.13)—and hence a larger 
covariance—when in fact the relationship does not change. A measure of the relationship 
between two variables that is not affected by the units of measurement for x and y is the 
correlation coefficient. 


Correlation Coefficient 


For sample data, the Pearson product moment correlation coefficient is defined as follows. 


PEARSON PRODUCT MOMENT CORRELATION COEFFICIENT: SAMPLE DATA 
p = (3.15) 


where 

= sample correlation coefficient 
S,,, = sample covariance 

s, = sample standard deviation of x 
s, = sample standard deviation of y 


Equation (3.15) shows that the Pearson product moment correlation coefficient for 
sample data (commonly referred to more simply as the sample correlation coefficient) is 
computed by dividing the sample covariance by the product of the sample standard devia- 
tion of x and the sample standard deviation of y. 

Let us now compute the sample correlation coefficient for the San Francisco electronics 
store. Using the data in Table 3.6, we can compute the sample standard deviations for the 


two variables: 
=\2 
i” 2 
s. = Z0) = Oe 1.49 
i n-1 9 


av) 
s- J= yy NE e 
i n=l 9 


Now, because s,, = 11, the sample correlation coefficient equals 


Syy 11 93 
r= = =, 
* sS, (1.49)(7.93) 


The formula for computing the correlation coefficient for a population, denoted by the 
Greek lower case letter rho, or Pij follows. 
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PEARSON PRODUCT MOMENT CORRELATION COEFFICIENT: POPULATION DATA 


(3.16) 


where 


Px, = population correlation coefficient 
Oy = population covariance 

T, = population standard deviation for x 
o, = population standard deviation for y 


The sample correlation coefficient r,, provides an estimate of the population correlation 
coefficient p,.,. 


Interpretation of the Correlation Coefficient 


First let us consider a simple example that illustrates the concept of a perfect positive linear 
relationship. The scatter diagram in Figure 3.18 depicts the relationship between x and y 
based on the following sample data. 


x; Yi 
5 10 
10 30 
15 50 


The straight line drawn through each of the three points shows a perfect linear relation- 
ship between x and y. In order to apply equation (3.15) to compute the sample correlation 
we must first compute Sy, 5,, and s,. Some of the computations are shown in Table 3.8. 


xy? 


FIGURE 3.18 Scatter Diagram Depicting a Perfect Positive 


Linear Relationship 
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TABLE 3.8 Computations Used in Calculating the Sample Correlation Coefficient 


Xi 


5 
10 
15 
30 


Totals x= 10 


Yi X; — X (x; — x)? Yay (y; - yy? (x; — xy; — y) 

10 -5 25 —20 400 100 

30 0 0 0 0 (0) 

50 E 25 20 400 100 

90 0 50 0 800 200 
y=30 


The correlation coefficient 
ranges from —1 to +1. 
Values close to —1 or +1 
indicate a strong linear 
relationship. The closer the 
correlation is to zero, the 
weaker the relationship. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


Using the results in this table, we find 


_ 3G; -90;—¥) _ 200 


=] 
a n-1 A 
(= -x 3 
5, = = =5 
5 n-1l 2 
2 
Z 800 
p ET. [Ea 
7 n-1 2 
Sy 100 
Sy Sy 520) 


Thus, we see that the value of the sample correlation coefficient is 1. 

In general, it can be shown that if all the points in a data set fall on a positively 
sloped straight line, the value of the sample correlation coefficient is +1; that is, a 
sample correlation coefficient of + 1 corresponds to a perfect positive linear relationship 
between x and y. Moreover, if the points in the data set fall on a straight line having 
negative slope, the value of the sample correlation coefficient is — 1; that is, a sample 
correlation coefficient of — 1 corresponds to a perfect negative linear relationship 
between x and y. 

Let us now suppose that a certain data set indicates a positive linear relationship 
between x and y but that the relationship is not perfect. The value of r,, will be less 
than 1, indicating that the points in the scatter diagram are not all on a straight line. 
As the points deviate more and more from a perfect positive linear relationship, the 
value of r,, becomes smaller and smaller. A value of r,, equal to zero indicates no 
linear relationship between x and y, and values of r,, near zero indicate a weak linear 
relationship. 

For the data involving the San Francisco electronics store, hy = .93. Therefore, we con- 
clude that a strong positive linear relationship occurs between the number of commercials 
and sales. More specifically, an increase in the number of commercials is associated with 
an increase in sales. 

In closing, we note that correlation provides a measure of linear association and not 
necessarily causation. A high correlation between two variables does not mean that 
changes in one variable will cause changes in the other variable. For example, we may 
find that the quality rating and the typical meal price of restaurants are positively cor- 
related. However, simply increasing the meal price at a restaurant will not cause the 
quality rating to increase. 


3.5 Measures of Association Between Two Variables 151 


FIGURE 3.19 | Using Excel to Compute the Covariance and Correlation Coefficient 


ID E | F | IGE) 


B | E 
Week |No. of Commercials Sales ($100s) 


Sample Covariance =COVARIANCE.S(B2:B11,C2:C11) 
Sample Correlation =CORREL(B2:B11,C2:C11) 


4 B | € EE E | F | G | 
Week |No. of Commercials Sales ($100s) 


= OMAAKDU SF WN 


SSH Sal a ela 


2; #1 Sample Covariance 11 
3| 2 Sample Correlation 0.93 
4| 3 

5| 4 

on 5 

7| 6 

s| 7 

9| 8 
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Using Excel to Compute the Sample Covariance 
and Sample Correlation Coefficient 


Excel provides functions that can be used to compute the covariance and correlation coeffi- 
cient. We illustrate the use of these functions by computing the sample covariance and the 
sample correlation coefficient for the San Francisco electronics data in Table 3.6. Refer to 
Figure 3.19 as we describe the tasks involved. The formula worksheet is in the background; 
the value worksheet is in the foreground. 


Enter/Access Data: Open the file Electronics. The data are in cells B2:C11 and labels are 
in column A and cells B1:Cl. 


(Q 


DATA file 
Electronics 
Enter Functions and Formulas: Excel’s COVARIANCES function can be used to com- 
pute the sample covariance by entering the following formula in cell F2: 


=COVARIANCE.S(B2:B11,C2:C11) 


Similarly, the formula =CORREL(B2:B11,C2:C11) is entered in cell F3 to compute the 
sample correlation coefficient. 

The labels in cells E2:E3 identify the output. Note that the sample covariance (11) and 
the sample correlation coefficient (.93) are the same as we computed earlier using the 
definitions. 


NOTES + COMMENTS 


1. Because the correlation coefficient measures only the The sample correlation coefficient for these data 


strength of the linear relationship between two quantita- is r„ = —.007 and indicates there is no linear relation- 


tive variables, it is possible for the correlation coefficient 
to be near zero, suggesting no linear relationship, when 
the relationship between the two variables is nonlinear. For 
example, the following scatter diagram shows the relation- 
ship between the amount spent by a small retail store for 
environmental control (heating and cooling) and the daily 
high outside temperature over 100 days. 


ship between the two variables. However, the scatter 
diagram provides strong visual evidence of a nonlinear 
relationship. That is, we can see that as the daily high 
outside temperature increases, the money spent on en- 
vironmental control first decreases as less heating is re- 
quired and then increases as greater cooling is required. 


(continued) 
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$ Spent on Environmental Control 


0 20 40 60 80 100 
Outside Temperature (Fahrenheit) 


. While the correlation coefficient is useful in assessing the when at least one of the variables is nominal or ordinal. We 
relationship between two quantitative variables, other mea- discuss the use of the Spearman rank-correlation coeffi- 
sures—such as the Spearman rank-correlation coefficient— cient in Chapter 18. 


can be used to assess a relationship between two variables 


(@ 


EXERCISES 


Methods 
55. Five observations taken for two variables follow. 


x, | 4 6 i 3 16 


y, |50 50 40 60 30 


a. Develop a scatter diagram with x on the horizontal axis. 
b. What does the scatter diagram developed in part (a) indicate about the relationship 
between the two variables? 
c. Compute and interpret the sample covariance. 
d. Compute and interpret the sample correlation coefficient. 
56. Five observations taken for two variables follow. 


aje n 5 a -3 
„yle 9 6 17 2 


. Develop a scatter diagram for these data. 

. What does the scatter diagram indicate about a relationship between x and y? 
. Compute and interpret the sample covariance. 

. Compute and interpret the sample correlation coefficient. 


aa oe 


Applications 
57. Stock Price Comparison. The file StockComparison contains monthly adjusted stock 
DATA fil e prices for technology company Apple, Inc., and consumer-goods company Procter & 
Gamble (P&G) from 2013 to 2018. 
a. Develop a scatter diagram with Apple stock price on the horizontal axis and P&G 
stock price on the vertical axis. 
b. What appears to be the relationship between these two stock prices? 
c. Compute and interpret the sample covariance. 
d. Compute the sample correlation coefficient. What does this value indicate about the 
relationship between the stock price of Apple and the stock price of P&G? 
58. Driving Speed and Fuel Efficiency. A department of transportation’s study on driving 
speed and miles per gallon for midsize automobiles resulted in the following data: 


StockComparison 
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DATA file 


SmokeDetectors 


DATA file 
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(@ 
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DATA file 


BestPrivateColleges 
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Sr aE EET a a a 

Miles per Gallon 28 25 25 23 30 32 21 35 26 25 
Compute and interpret the sample correlation coefficient. 

59. Smoke Detector Use and Death Rates. Over the past 40 years, the percentage of homes 
in the United States with smoke detectors has risen steadily and has plateaued at about 
96% as of 2015 (National Fire Protection Association website). With this increase in the 
use of home smoke detectors, what has happened to the death rate from home fires? The 
file SnokeDetectors contains 17 years of data on the estimated percentage of homes with 
smoke detectors and the estimated home fire deaths per million of population. 

a. Do you expect a positive or negative relationship between smoke detector use and 
deaths from home fires? Why or why not? 

b. Compute and report the correlation coefficient. Is there a positive or negative cor- 
relation between smoke detector use and deaths from home fires? Comment. 

c. Show a scatter plot of the death rate per million of population and the percentage of 
homes with smoke detectors. 

60. Stock Market Indexes Comparison. The Russell 1000 is a stock market index con- 
sisting of the largest U.S. companies. The Dow Jones Industrial Average is based on 30 
large companies. The file Russell gives the annual percentage returns for each of these 
stock indexes for the years 1988 to 2012 (1Stock1 website). 

a. Plot these percentage returns using a scatter plot. 

b. Compute the sample mean and standard deviation for each index. 
c. Compute the sample correlation. 

d. Discuss similarities and differences in these two indexes. 

61. Best Private Colleges. A random sample of 30 colleges from Kiplinger’s list of the best 
values in private college provided the data shown in the file BestPrivateColleges (Kiplinger 
website). The variable Admit Rate (%) shows the percentage of students that applied to the 
college and were admitted, and the variable 4-yr Grad. Rate (%) shows the percentage of 
students that were admitted and graduated in four years. 

a. Develop a scatter diagram with Admit Rate (%) as the independent variable. What 
does the scatter diagram indicate about the relationship between the two variables? 

b. Compute the sample correlation coefficient. What does the value of the sample 
correlation coefficient indicate about the relationship between the Admit Rate (%) 
and the 4-yr Grad. Rate (%)? 


3.6 Data Dashboards: Adding Numerical Measures 
to Improve Effectiveness 


In Section 2.5 we provided an introduction to data visualization, a term used to describe the 
use of graphical displays to summarize and present information about a data set. The goal 
of data visualization is to communicate key information about the data as effectively and 
clearly as possible. One of the most widely used data visualization tools is a data dashboard, 
a set of visual displays that organizes and presents information that is used to monitor the 
performance of a company or organization in a manner that is easy to read, understand, and 
interpret. In this section we extend the discussion of data dashboards to show how the addi- 
tion of numerical measures can improve the overall effectiveness of the display. 

The addition of numerical measures, such as the mean and standard deviation of key per- 
formance indicators (KPIs), to a data dashboard is critical because numerical measures often 
provide benchmarks or goals by which KPIs are evaluated. In addition, graphical displays 
that include numerical measures as components of the display are also frequently included 
in data dashboards. We must keep in mind that the purpose of a data dashboard is to provide 
information on the KPIs in a manner that is easy to read, understand, and interpret. Adding 
numerical measures and graphs that utilize numerical measures can help us accomplish 
these objectives. 
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FIGURE 3.20 Initial Grogan Oil Information Technology Call Center Data Dashboard 
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: To illustrate the use of numerical measures in a data dashboard, recall the Grogan Oil 
DAIAS ile Company application that we used in Section 2.5 to introduce the concept of a data dashboard. 
Grogan Oil has offices located in three cities in Texas: Austin (its headquarters), Houston, and 
Dallas. Grogan’s Information Technology (IT) call center, located in the Austin office, handles 
calls regarding computer-related problems (software, Internet, and e-mail) from employees in 
the three offices. Figure 3.20 shows the data dashboard that Grogan developed to monitor the 
performance of the call center. The key components of this dashboard are as follows: 


Grogan 


e The stacked bar chart in the upper left corner of the dashboard shows the call volume 
for each type of problem (software, Internet, or e-mail) over time. 

e The bar chart in the upper right corner of the dashboard shows the percentage of time that 
call center employees spent on each type of problem or not working on a call (idle). 

e For each unresolved case that extended beyond 15 minutes, the bar chart shown in 
the middle left portion of the dashboard shows the length of time that each of these 
cases has been unresolved. 

e The bar chart in the middle right portion of the dashboard shows the call volume by 
office (Houston, Dallas, Austin) for each type of problem. 

e The histogram at the bottom of the dashboard shows the distribution of the time to 
resolve a case for all resolved cases for the current shift. 


In order to gain additional insight into the performance of the call center, Grogan’s IT 
manager has decided to expand the current dashboard by adding boxplots for the time re- 
quired to resolve calls received for each type of problem (e-mail, Internet, and software). In 
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FIGURE 3.21 Updated Grogan Oil Information Technology Call Center Data Dashboard 
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Type of Case Cases Mean Median Std. Dev Hour Cases Mean Median Std. Dev 
E-mail 34 5.8 35 5.4 8:00 
Internet 19 3.0 1.0 4.0 9:00 
Software 23 5.4 4.0 5.0 10:00 
11:00 
12:00 


addition, a graph showing the time to resolve individual cases has been added in the lower 

left portion of the dashboard. Finally, the IT manager added a display of summary statistics 
for each type of problem and summary statistics for each of the first few hours of the shift. 

The updated dashboard is shown in Figure 3.21. 
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The IT call center has set a target performance level or benchmark of 10 minutes for 
the mean time to resolve a case. Furthermore, the center has decided it is undesirable 
for the time to resolve a case to exceed 15 minutes. To reflect these benchmarks, a black 
horizontal line at the mean target value of 10 minutes and a red horizontal line at the 
maximum acceptable level of 15 minutes have been added to both the graph showing 
the time to resolve cases and the boxplots of the time required to resolve calls received 
for each type of problem. 

The summary statistics in the dashboard in Figure 3.21 show that the mean time to 
resolve an e-mail case is 5.8 minutes, the mean time to resolve an Internet case is 3.0 min- 
utes, and the mean time to resolve a software case is 5.4 minutes. Thus, the mean time to 
resolve each type of case is better than the target mean (10 minutes). 

Reviewing the boxplots, we see that the box associated with the e-mail cases is 
larger than the boxes associated with the other two types of cases. The summary 
statistics also show that the standard deviation of the time to resolve e-mail cases is 
larger than the standard deviations of the times to resolve the other types of cases. 
This leads us to take a closer look at the e-mail cases in the two new graphs. The box- 
plot for the e-mail cases has a whisker that extends beyond 15 minutes and an outlier 
well beyond 15 minutes. The graph of the time to resolve individual cases (in the 
lower left position of the dashboard) shows that this is because of two calls on e-mail 
cases during the 9:00 hour that took longer than the target maximum time (15 minutes) 
to resolve. This analysis may lead the IT call center manager to further investigate 
why resolution times are more variable for e-mail cases than for Internet or software 
cases. Based on this analysis, the IT manager may also decide to investigate the cir- 
cumstances that led to inordinately long resolution times for the two e-mail cases that 
took longer than 15 minutes to resolve. 

The graph of the time to resolve individual cases also shows that most calls received 
during the first hour of the shift were resolved relatively quickly; the graph also shows 
that the time to resolve cases increased gradually throughout the morning. This could 
be due to a tendency for complex problems to arise later in the shift or possibly to the 
backlog of calls that accumulates over time. Although the summary statistics sug- 
gest that cases submitted during the 9:00 hour take the longest to resolve, the graph of 
time to resolve individual cases shows that two time-consuming e-mail cases and one 
time-consuming software case were reported during that hour, and this may explain 
why the mean time to resolve cases during the 9:00 hour is larger than during any other 
hour of the shift. Overall, reported cases have generally been resolved in 15 minutes or 


Drilling down refers to less during this shift. 
functionality in interactive Dashboards such as the Grogan Oil data dashboard are often interactive. For instance, 
data dashboards that when a manager uses a mouse or a touch screen monitor to position the cursor over the 


allows the user to access display or point to something on the display, additional information, such as the time to 
information and analyses resolve the problem, the time the call was received, and the individual and/or the location 
at an increasingly detailed that reported the problem, may appear. Clicking on the individual item may also take the 
level. user to a new level of analysis at the individual case level. 


SUMMARY 
In this chapter we introduced several descriptive statistics that can be used to summarize 

the location, variability, and shape of a data distribution. Unlike the tabular and graphical 
displays introduced in Chapter 2, the measures introduced in this chapter summarize the data 
in terms of numerical values. When the numerical values obtained are for a sample, they are 
called sample statistics. When the numerical values obtained are for a population, they are 
called population parameters. Some of the notation used for sample statistics and population 
parameters follow. 
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As measures of location, we defined the mean, median, mode, weighted mean, geomet- 
ric mean, percentiles, and quartiles. Next, we presented the range, interquartile range, vari- 
ance, standard deviation, and coefficient of variation as measures of variability or disper- 
sion. Our primary measure of the shape of a data distribution was the skewness. Negative 
values of skewness indicate a data distribution skewed to the left, and positive values of 
skewness indicate a data distribution skewed to the right. We then described how the mean 
and standard deviation could be used, applying Chebyshev’s theorem and the empirical 
tule, to provide more information about the distribution of data and to identify outliers. 

In Section 3.4 we showed how to develop a five-number summary and a boxplot 
to provide simultaneous information about the location, variability, and shape of the 
distribution. In Section 3.5 we introduced covariance and the correlation coefficient as 
measures of association between two variables. In the final section, we showed how adding 
numerical measures can improve the effectiveness of data dashboards. 

The descriptive statistics we discussed can be developed using statistical software pack- 
ages and spreadsheets. 


GLOSSARY 
Boxplot A graphical summary of data based on a five-number summary. 

Chebyshev’s theorem A theorem that can be used to make statements about the propor- 
tion of data values that must be within a specified number of standard deviations of the 
mean. 

Coefficient of variation A measure of relative variability computed by dividing the stan- 
dard deviation by the mean and multiplying by 100. 

Correlation coefficient A measure of linear association between two variables that takes 
on values between —1 and +1. Values near + 1 indicate a strong positive linear relation- 
ship; values near — 1 indicate a strong negative linear relationship; and values near zero 
indicate the lack of a linear relationship. 

Covariance A measure of linear association between two variables. Positive values indi- 
cate a positive relationship; negative values indicate a negative relationship. 

Empirical rule A rule that can be used to compute the percentage of data values that 
must be within one, two, and three standard deviations of the mean for data that exhibit a 
bell-shaped distribution. 

Five-number summary A technique that uses five numbers to summarize the data: smallest 
value, first quartile, median, third quartile, and largest value. 

Geometric mean A measure of location that is calculated by finding the nth root of the 
product of n values. 

Interquartile range (IQR) A measure of variability, defined as the difference between the 
third and first quartiles. 

Mean A measure of central location computed by summing the data values and dividing 
by the number of observations. 

Median A measure of central location provided by the value in the middle when the data 
are arranged in ascending order. 
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Mode A measure of location, defined as the value that occurs with greatest frequency. 
Outlier An unusually small or unusually large data value. 

Pearson product moment correlation coefficient A measure of the linear relationship 
between two variables. 

Percentile A value that provides information about how the data are spread over the inter- 
val from the smallest to the largest value. 

Point estimator A sample statistic, such as x, s?, and s, used to estimate the corresponding 
population parameter. 

Population parameter A numerical value used as a summary measure for a population (e.g., 
the population mean, m, the population variance, o°, and the population standard deviation, o). 
pth percentile For a data set containing n observations, the pth percentile divides 

the data into two parts: Approximately p% of the observation are less than the pth 
percentile and approximately (100 — p)% of the observations are greater than the pth 
percentile. 

Quartiles The 25th, 50th, and 75th percentiles, referred to as the first quartile, the second 
quartile (median), and third quartile, respectively. The quartiles can be used to divide a data 
set into four parts, with each part containing approximately 25% of the data. 

Range A measure of variability, defined to be the largest value minus the smallest value. 
Sample statistic A numerical value used as a summary measure for a sample (e.g., the 
sample mean, x, the sample variance, s?, and the sample standard deviation, s). 

Skewness A measure of the shape of a data distribution. Data skewed to the left result in 
negative skewness; a symmetric data distribution results in zero skewness; and data skewed 
to the right result in positive skewness. 

Standard deviation A measure of variability computed by taking the positive square root 
of the variance. 

Variance A measure of variability based on the squared deviations of the data values 
about the mean. 

Weighted mean The mean obtained by assigning each observation a weight that reflects 
its importance. 

z-score A value computed by dividing the deviation about the mean (x; — x) by the stan- 
dard deviation s. A z-score is referred to as a standardized value and denotes the number of 
standard deviations x; is from the mean. 


KEY FORMULAS _ 


Sample Mean 
t 
x= 2H (3.1) 
n 
Population Mean 
DXi 
= 3.2 
ome (3.2) 
Weighted Mean 
WwW; X 
x= AWA (3.3) 
Xw; 
Geometric Mean 
X; = Va) - +) = ee) --- I" (3.4) 
Location of the pth Percentile 
P 
L=- atl 5 
» = Fo @ + YD (3.5) 
Interquartile Range 
IQR = Q, — Q; (3.6) 
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SUPPLEMENTARY EXERCISES 


62. Americans Dining Out. Americans tend to dine out multiple times per week. The num- 
ber of times a sample of 20 families dined out last week provides the following data. 


6 | 5 3 7 3 0 3 1 
4 1 2 4 1 0 5 6 3 1 
Compute the mean and median. 
Compute the first and third quartiles. 
Compute the range and interquartile range. 
. Compute the variance and standard deviation. 
The skewness measure for these data is 0.34. Comment on the shape of this distri- 
bution. Is it the shape you would expect? Why or why not? 
f. Do the data contain outliers? 
DATA fi le 63. NCAA Football Coaches Salaries. A 2017 USA Today article reports that NCAA 
football coaches’ salaries have continued to increase in recent years (USA Today). The 
annual base salaries for the previous head football coach and the new head football 
coach at 23 schools are given in the file Coaches. 
a. Determine the median annual salary for a previous head football coach and a new 
head football coach. 


ene ge 


(Q 


Coaches 
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Sleep 
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64. 


65. 


b. Compute the range for salaries for both previous and new head football coaches. 

c. Compute the standard deviation for salaries for both previous and new head football 
coaches. 

d. Based on your answers to (a) to (c), comment on any differences between the an- 
nual base salary a school pays a new head football coach compared to what it paid 
its previous head football coach. 

Physician Office Waiting Times. The average waiting time for a patient at an El Paso 

physician’s office is just over 29 minutes, well above the national average of 21 minutes. 

In order to address the issue of long patient wait times, some physician’s offices are 

using wait tracking systems to notify patients of expected wait times. Patients can adjust 

their arrival times based on this information and spend less time in waiting rooms. The 
following data show wait times (minutes) for a sample of patients at offices that do not 
have an office tracking system and wait times for a sample of patients at offices with an 
office tracking system. 


Without Wait With Wait 

Tracking System Tracking System 
24 31 
67 14 
17 14 
20 18 
31 12 
44 37 
12 9 
25 1s 
16 12 
37 15 


a. What are the mean and median patient wait times for offices with a wait tracking 
system? What are the mean and median patient wait times for offices without a wait 
tracking system? 

b. What are the variance and standard deviation of patient wait times for offices with a 
wait tracking system? What are the variance and standard deviation of patient wait 
times for visits to offices without a wait tracking system? 

c. Do offices with a wait tracking system have shorter patient wait times than offices 
without a wait tracking system? Explain. 

d. Considering only offices without a wait tracking system, what is the z-score for the 
tenth patient in the sample? 

e. Considering only offices with a wait tracking system, what is the z-score for the sixth 
patient in the sample? How does this z-score compare with the z-score you calculated for 
part (d)? 

f. Based on z-scores, do the data for offices without a wait tracking system contain 
any outliers? Based on z-scores, do the data for offices with a wait tracking system 
contain any outliers? 

Worker Productivity and Insomnia. U.S. companies lose $63.2 billion per year 

from workers with insomnia. According to a 2013 article in The Wall Street Journal, 

workers lose an average of 7.8 days of productivity per year due to lack of sleep. The 

following data show the number of hours of sleep attained during a recent night for a 

sample of 20 workers. 
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7 8 6 9 8 9 6 10 


a. What is the mean number of hours of sleep for this sample? 
b. What is the variance? Standard deviation? 

66. Smartphone Use. Smartphones have become ubiquitous for most people and have be- 
come the predominant means of communication among people. Consider the follow- 
ing data indicating the number of minutes in a month spent interacting with others via 
a smartphone for a sample of 50 smartphone users. 


> ; 353 458 404 394 416 
S= DATA file 437 430 369 448 430 
jaiii 431 469 446 387 445 
354 468 422 402 360 
444 424 441 357 435 
461 407 470 413 351 
464 374 417 460 352 
445 387 468 368 430 
384 367 436 390 464 
405 372 401 388 367 
a. What is the mean number of minutes spent interacting with others for this sample? 
How does it compare to the mean reported in the study? 
b. What is the standard deviation for this sample? 
c. Are there any outliers in this sample? 

67. Work Commuting Methods. Public transportation and the automobile are two meth- 
ods an employee can use to get to work each day. Samples of travel times recorded for 
each method are shown. Times are in minutes. 

S DATA file | | 
= Public Transportation: 28 29 32 37 33 25 29 32 41 34 
Mansportation Automobile: 29 31 33 32 34 30 31 32 35 33 


a. Compute the sample mean time to get to work for each method. 

b. Compute the sample standard deviation for each method. 

c. On the basis of your results from parts (a) and (b), which method of transportation 
should be preferred? Explain. 

d. Develop a boxplot for each method. Does a comparison of the boxplots support 
your conclusion in part (c)? 

68. The following data represent a sample of 14 household incomes ($1000s). Answer the 
following questions based on this sample. 


49.4 52.4 53.4 51.3 52.1 48.7 52.1 
52.2 64.5 51:6 46.5 52.9 52.5 51.2 


a. What is the median household income for these sample data? 

b. According to a previous survey, the median annual household income five years 
ago was $55,000. Based on the sample data above, estimate the percentage change 
in the median household income from five years ago to today. 

c. Compute the first and third quartiles of the sample data. 

d. Provide a five-number summary for the sample data. 

e. Using the z-score approach, do the data contain any outliers? Does the approach 
that uses the values of the first and third quartiles and the interquartile range to 
detect outliers provide the same results? 

69. Restaurant Chains’ Sales per Store. The data contained in the file FoodIndustry 
show the company/chain name, the average sales per store ($1000s), and the food seg- 
ment industry for 47 restaurant chains (Quick Service Restaurant Magazine website). 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


Chapter 3 Descriptive Statistics: Numerical Measures 


(Q 


DATA fil a. What was the mean U.S. sales per store for the 47 restaurant chains? 

I ne b. What are the first and third quartiles? What is your interpretation of the quartiles? 

c. Show a boxplot for the level of sales and discuss if there are any outliers in terms of 
sales that would skew the results. 

d. Develop a frequency distribution showing the average sales per store for each seg- 
ment. Comment on the results obtained. 

70. Best Hotels. Travel + Leisure magazine presented its annual list of the 500 best 
hotels in the world. The magazine provides a rating for each hotel along with a brief 
description that includes the size of the hotel, amenities, and the cost per night for a 
double room. A sample of 12 of the top-rated hotels in the United States follows. 


FoodIndustry 


(Q 


Hotel Location Rooms Cost/Night($) 
Boulders Resort & Spa Phoenix, AZ 220 499 
Disney's Wilderness Lodge Orlando, FL 727 340 
Four Seasons Hotel Beverly Hills Los Angeles, CA 285 585 
; Four Seasons Hotel Boston, MA 273 495 
DATA file Hay-Adams Washington, DC 145 495 
Travel Inn on Biltmore Estate Asheville, NC 213 279 
Loews Ventana Canyon Resort Phoenix, AZ 398 279) 
Mauna Lani Bay Hotel Island of Hawaii 343 455 
Montage Laguna Beach Laguna Beach, CA 250 595 
Sofitel Water Tower Chicago, IL 414 367 
St. Regis Monarch Beach Dana Point, CA 400 675 
The Broadmoor Colorado Springs, CO 700 420 


a. What is the mean number of rooms? 

b. What is the mean cost per night for a double room? 

c. Develop a scatter diagram with the number of rooms on the horizontal axis and the 
cost per night on the vertical axis. Does there appear to be a relationship between 
the number of rooms and the cost per night? Discuss. 

d. What is the sample correlation coefficient? What does it tell you about the relation- 
ship between the number of rooms and the cost per night for a double room? Does 
this appear reasonable? Discuss. 

71. NFL Teams Worth. In 2014, the 32 teams in the National Football League (NFL) 
were worth, on average, $1.17 billion, 5% more than in 2013. The following data 
show the annual revenue ($ millions) and the estimated team value ($ mil- 
lions) for the 32 NFL teams in 2014 (Forbes website). 


Current 
Revenue Value 

Team ($ millions) ($ millions) 
Arizona Cardinals 253 961 
Atlanta Falcons 252 933 
Baltimore Ravens 292 1227 
Buffalo Bills 256 870 
Carolina Panthers 271 1057 
Chicago Bears 298 1252 
Cincinnati Bengals 250 924 
Cleveland Browns 264 1005 
Dallas Cowboys 539. 2300 
Denver Broncos 283 1161 
Detroit Lions 248 900 
Green Bay Packers 282 1183 
Houston Texans 320 1450 
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Current 
Revenue Value 
Team ($ millions) ($ millions) 
GB . Indianapolis Colts 276 1200 
= DATA file Jacksonville Jaguars 260 840 
NFLTeamValue Kansas City Chiefs 245 1009 
Miami Dolphins 268 1074 
Minnesota Vikings 234 1007 
New England Patriots 408 1800 
New Orleans Saints 276 1004 
New York Giants 338 1550 
New York Jets 321 1380 
Oakland Raiders 229 825 
Philadelphia Eagles 306 1314 
Pittsburgh Steelers 266 1118 
San Diego Chargers 250 949 
San Francisco 49ers 255) 1224 
Seattle Seahawks 270 1081 
St. Louis Rams 239 875 
Tampa Bay Buccaneers 267 1067 
Tennessee Titans 270 1055 
Washington Redskins 381 1700 


a. Develop a scatter diagram with Revenue on the horizontal axis and Value on the 
vertical axis. Does there appear that there is any relationship between the two 
variables? 

b. What is the sample correlation coefficient? What can you say about the strength of 
the relationship between Revenue and Value? 

72. Major League Baseball Team Winning Percentages. Does a major league baseball 
team’s record during spring training indicate how the team will play during the regular 
season? Over a six-year period, the correlation coefficient between a team’s winning 
percentage in spring training and its winning percentage in the regular season is .18. 
Shown are the winning percentages for the 14 American League teams during a 
previous season. 


Spring Regular Spring Regular 
= DATA file a | Training Season mon | Training Season 
SpringTraining Baltimore Orioles .407 .422 Minnesota Twins .500 .540 
Boston Red Sox 429 586 New York Yankees iT) 549 
Chicago White Sox 417 546 Oakland A's 692 466 
Cleveland Indians 569 .500 Seattle Mariners .500 B 
Detroit Tigers .569 .457 Tampa Bay Rays AS) 599. 
Kansas City Royals 588 .463 Texas Rangers .643 .488 
Los Angeles Angels .724 FON, Toronto Blue Jays 448 S] 


a. What is the correlation coefficient between the spring training and the regular 
season winning percentages? 

b. What is your conclusion about a team’s record during spring training indicating 
how the team will play during the regular season? What are some of the reasons 
why this occurs? Discuss. 

73. Money Market Funds Days to Maturity. The days to maturity for a sample of five 
money market funds are shown here. The dollar amounts invested in the funds are 
provided. Use the weighted mean to determine the mean number of days to maturity 
for dollars invested in these five money market funds. 
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Days to Dollar Value 
Maturity ($ millions) 
20 20 
12 30 
7 10 
5 15 
6 10 


74. Automobile Speeds. Automobiles traveling on a road with a posted speed limit of 
55 miles per hour are checked for speed by a state police radar system. Following is a 
frequency distribution of speeds. 
a. What is the mean speed of the automobiles traveling on this road? 
b. Compute the variance and the standard deviation. 


Speed 

(miles per hour) Frequency 
45-49 10 
50-54 40 
55-59 150 
60-64 175 
65-69 75 
70-74 15 
75-79 10. 


Total 475 


75. Annual Returns for Panama Railroad Company Stock. The Panama Railroad 
Company was established in 1850 to construct a railroad across the isthmus that would 
allow fast and easy access between the Atlantic and Pacific Oceans. The following 
table provides annual returns for Panama Railroad stock from 1853 through 1880. 


Return on Panama Railroad 


Year Company Stock (%) 
1853 =" 
1854 a9 
1855 19) 
1856 2 
1857 3 
1858 36 
1859 2i 
1860 16 
1861 =5 
1862 43 
1863 44 
=> f 1864 48 
© DATA file Pan 5 
PanamaRailroad 1866 11 
1867 23 
1868 20 
1869 =] 
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Return on Panama Railroad 


Year Company Stock (%) 
1870 =| 
1871 —42 
1872 39 
1873 42 
1874 12 
1875 26 
1876 9 
1877 =o 
1878 25 
1879 31 
1880 30 


a. Create a graph of the annual returns on the stock. The New York Stock Exchange 
earned an annual average return of 8.4% from 1853 through 1880. Can you tell 
from the graph if the Panama Railroad Company stock outperformed the New York 
Stock Exchange? 

b. Calculate the mean annual return on Panama Railroad Company stock from 1853 
through 1880. Did the stock outperform the New York Stock Exchange over the 
same period? 


CASE PROBLEM 1: PELICAN STORES 
Pelican Stores, a division of National Clothing, is a chain of women’s apparel stores 
operating throughout the country. The chain recently ran a promotion in which discount 
coupons were sent to customers of other National Clothing stores. Data collected for a sam- 
ple of 100 in-store credit card transactions at Pelican Stores during one day while the pro- 
motion was running are contained in the file PelicanStores. Table 3.9 shows a portion of the 
data set. The proprietary card method of payment refers to charges made using a National 
Clothing charge card. Customers who made a purchase using a discount coupon are referred 
to as promotional customers and customers who made a purchase but did not use a discount 
coupon are referred to as regular customers. Because the promotional coupons were not sent 
to regular Pelican Stores customers, management considers the sales made to people pre- 
senting the promotional coupons as sales it would not otherwise make. Of course, Pelican 
also hopes that the promotional customers will continue to shop at its stores. 

Most of the variables shown in Table 3.9 are self-explanatory, but two of the variables 
require some clarification. 


Items The total number of items purchased 
Net Sales The total amount ($) charged to the credit card 


Pelican’s management would like to use this sample data to learn about its customer base 
and to evaluate the promotion involving discount coupons. 


Managerial Report 
Use the methods of descriptive statistics presented in this chapter to summarize the data and 
comment on your findings. At a minimum, your report should include the following: 


1. Descriptive statistics on net sales and descriptive statistics on net sales by various 
classifications of customers. 
2. Descriptive statistics concerning the relationship between age and net sales. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


166 Chapter 3 Descriptive Statistics: Numerical Measures 


TABLE 3.9 Sample of 100 Credit Card Purchases at Pelican Stores 


Type of Method of Marital 
Customer Customer Items NetSales Payment Gender Status Age 
1 Regular 1 39.50 Discover Male Married 32 
2 Promotional 1 102.40 Proprietary Card Female Married 36 
3 Regular 1 22.50 Proprietary Card Female Married 32 
4 Promotional 5 100.40 Proprietary Card Female Married 28 
5 Regular 2 54.00 MasterCard Female Married 34 
6 Regular 1 44.50 MasterCard Female Married 44 
Ss DATA file 7 Promotional 2 78.00 Proprietary Card Female Married 30 
~ : A 
PelicanStores 8 Regular 1 22.50 Visa Female Married 40 
9 Promotional 2 56.52 Proprietary Card Female Married 46 
10 Regular 1 44.50 Proprietary Card Female Married 36 
96 Regular 1 39.50 MasterCard Female Married 44 
97 Promotional 9 253.00 Proprietary Card Female Married 30 
98 Promotional 10 287.59 Proprietary Card Female Married 52 
29, Promotional 2 47.60 Proprietary Card Female Married 30 
100 Promotional 1 28.44 Proprietary Card Female Married 44 


————————————SS a a E) 
CASE PROBLEM 2: MOVIE THEATER RELEASES 


@eeeeeeeeeeeeeeeeeeeeeeeeeeneeeeeeeeeeneeeeeeeeeeeeeeneeeeeeeee 


The movie industry is a competitive business. More than 50 studios produce hundreds of 
new movies for theater release each year, and the financial success of each movie varies 
considerably. The opening weekend gross sales ($ millions), the total gross sales 

($ millions), the number of theaters the movie was shown in, and the number of weeks 
the movie was in release are common variables used to measure the success of a movie. 
Data on the top 100 grossing movies released in 2016 (Box Office Mojo website) are con- 
tained in the file Movies2016. Table 3.10 shows the data for the first 10 movies in this file. 


TABLE 3.10 Performance Data for Ten 2016 Movies Released to Theaters 


Opening Total Gross 
Gross Sales Sales Number of Weeks in 
Movie Title ($ million) ($ million) Theaters Release 
Rogue One: A Star 155.08 532.18 4,157 20 
Wars Story 
Finding Dory 135.06 486.30 4,305 25 
Captain America: Ne! 408.08 4,226 20 
Civil War 
The Secret Life of Pets 104.35 368.38 4,381 25 
= DATA file The Jungle Book 103.26 364.00 4,144 24 
Movies2016 Deadpool 132.43 363.07 3,856 18 
Zootopia 75.06 341.27 3 ORY) 22 
Batman v Superman: 166.01 330.36 4,256 12 
Dawn of Justice 
Suicide Squad 133.68 325.10 4255 14 
Sing 35.26 270.40 4,029 20 
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Managerial Report 
Use the numerical methods of descriptive statistics presented in this chapter to learn how these 
variables contribute to the success of a movie. Include the following in your report: 


1. Descriptive statistics for each of the four variables along with a discussion of what the 
descriptive statistics tell us about the movie industry. 

2. What movies, if any, should be considered high-performance outliers? Explain. 

3. Descriptive statistics showing the relationship between total gross sales and each of the 
other variables. Discuss. 


CASE PROBLEM 3: BUSINESS SCHOOLS 
OF ASIA-PACIFIC 


DATA fil The pursuit of a higher education degree in business is now international. A survey shows 

f ue that more and more Asians choose the master of business administration (MBA) degree 
route to corporate success. As a result, the number of applicants for MBA courses at 
Asia-Pacific schools continues to increase. 

Across the region, thousands of Asians show an increasing willingness to temporarily 
shelve their careers and spend two years in pursuit of a theoretical business qualification. 
Courses in these schools are notoriously tough and include economics, banking, marketing, 
behavioral sciences, labor relations, decision making, strategic thinking, business law, and 
more. The data set in Table 3.11 shows some of the characteristics of the leading Asia-Pacific 
business schools. 


(@ 


AsiaMBA 


Managerial Report 

Use the methods of descriptive statistics to summarize the data in Table 3.11. Discuss 
your findings. Your discussion should include a summary of each variable in the data set. 
What new insights do these descriptive statistics provide concerning Asia-Pacific business 
schools? You should also analyze differences between local and foreign tuition costs, be- 
tween mean starting salaries for schools requiring and not requiring work experience, and 
between starting salaries for schools requiring and not requiring English tests. 


CASE PROBLEM 4: HEAVENLY CHOCOLATES 
See ee ee E 


Heavenly Chocolates manufactures and sells quality chocolate products at its plant and 
retail store located in Saratoga Springs, New York. Two years ago the company developed 
a website and began selling its products over the Internet. Website sales have exceeded 

the company’s expectations, and management is now considering strategies to increase 
sales even further. To learn more about the website customers, a sample of 50 Heavenly 
Chocolate transactions was selected from the previous month’s sales. Data showing the day 
of the week each transaction was made, the type of browser the customer used, the time 
spent on the website, the number of website pages viewed, and the amount spent by each 
of the 50 customers are contained in the file HeavenlyChocolates. A portion of the data is 
shown in Table 3.12. 

Heavenly Chocolates would like to use the sample data to determine if online shoppers 
who spend more time and view more pages also spend more money during their visit to the 
website. The company would also like to investigate the effect that the day of the week and 
the type of browser have on sales. 


Managerial Report 
Use the methods of descriptive statistics to learn about the customers who visit the Heav- 
enly Chocolates website. Include the following in your report: 


1. Graphical and numerical summaries for the length of time the shopper spends on the 
website, the number of pages viewed, and the mean amount spent per transaction. 
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TABLE 3.12 A Sample of 50 Heavenly Chocolates Website Transactions 


Pages Amount 

Customer Day Browser Time (min) Viewed Spent ($) 
1 Mon Chrome 12.0 4 54.52 
2 Wed Other 19.5 6 94.90 
A g 3 Mon Chrome 85 4 26.68 
— DATA file 4 Tue Firefox A 2 44.73 
HeavénlyChocolates 5 Wed Chrome 11.3 4 66.27 
6 Sat Firefox 10.5 6 67.80 
7 Sun Chrome 11.4 2 36.04 
48 Ra Chrome 97 5 103.1 5 
49 Mon Other A3 6 52415 
50 Fri Chrome BA 3 AS 


Discuss what you learn about Heavenly Chocolates’ online shoppers from these 
numerical summaries. 

2. Summarize the frequency, the total dollars spent, and the mean amount spent per 
transaction for each day of week. What observations can you make about Heavenly 
Chocolates’ business based on the day of the week? Discuss. 

3. Summarize the frequency, the total dollars spent, and the mean amount spent per 
transaction for each type of browser. What observations can you make about Heavenly 
Chocolates’ business based on the type of browser? Discuss. 

4. Develop a scatter diagram and compute the sample correlation coefficient to explore 
the relationship between the time spent on the website and the dollar amount spent. 
Use the horizontal axis for the time spent on the website. Discuss. 

5. Develop a scatter diagram and compute the sample correlation coefficient to explore the 
relationship between the number of website pages viewed and the amount spent. Use 
the horizontal axis for the number of website pages viewed. Discuss. 

6. Develop a scatter diagram and compute the sample correlation coefficient to explore 
the relationship between the time spent on the website and the number of pages viewed. 
Use the horizontal axis to represent the number of pages viewed. Discuss. 


CASE PROBLEM 5: AFRICAN ELEPHANT 
POPULATIONS 


SCOHOSSHSSHSSHSSHSHSHSHSHSHSHSSHSSHSHSHSSHSHSHSHSHSHSHSHSHSHSHSHSSSESSHSSHEHSSHSSHESSEHSSEEHESEEE 


Although millions of elephants once roamed across Africa, by the mid-1980s elephant 
populations in African nations had been devastated by poaching. Elephants are important 
to African ecosystems. In tropical forests, elephants create clearings in the canopy that 
encourage new tree growth. In savannas, elephants reduce bush cover to create an envi- 
ronment that is favorable to browsing and grazing animals. In addition, the seeds of many 
plant species depend on passing through an elephant’s digestive tract before germination. 

The status of the elephant now varies greatly across the continent. In some nations, strong 
measures have been taken to effectively protect elephant populations; for example, Kenya 
has destroyed over five tons of elephant ivory confiscated from poachers in an attempt to 
deter the growth of illegal ivory trade (Associated Press, July 20, 2011). In other nations the 
elephant populations remain in danger due to poaching for meat and ivory, loss of habitat, 
and conflict with humans. Table 3.13 shows elephant populations for several African nations 
in 1979, 1989, 2007, and 2012 (ElephantDatabase.org website). 
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TABLE 3.13 Elephant Populations for Several African Nations in 1979, 1989, 


2007, and 2012 


Elephant Population 


Country 1979 1989 2007 2012 
Angola 12,400 12,400 2590 2,530 
Botswana 20,000 51,000 175,487 175,454 
Cameroon 16,200 21,200 15387 14,049 
=> ; Cen African Re 63,000 19,000 3,334 2,285 
© DATA file Chad : 15,000 3,100 6,435 3,004 
AfricanElephants Congo 10,800 70,000 22,102 49,248 
Dem Rep of Congo 377,700 85,000 23,714 13,674 
Gabon 13,400 76,000 70,637 TH 252) 
Kenya 65,000 19,000 31,636 36,260 
Mozambique 54,800 18,600 26,088 26,513 
Somalia 24,300 6,000 70 70 
Tanzania 316,300 80,000 167,003 117,456 
Zambia 150,000 41,000 2929 2117539. 
Zimbabwe 30,000 43,000 99,107 100,291 


The David Sheldrick Wildlife Trust was established in 1977 to honor the memory of 
naturalist David Leslie William Sheldrick, who founded Warden of Tsavo East National 
Park in Kenya and headed the Planning Unit of the Wildlife Conservation and Management 
Department in that country. Management of the Sheldrick Trust would like to know what 
these data indicate about elephant populations in various African countries since 1979. 


Managerial Report 

Use methods of descriptive statistics to summarize the data and comment on changes in 
elephant populations in African nations since 1979. At a minimum your report should 
include the following. 


1. The mean annual change in elephant population for each country in the 10 years from 
1979 to 1989, and a discussion of which countries saw the largest changes in elephant 
population over this 10-year period. 

2. The mean annual change in elephant population for each country from 1989 to 2007, 
and a discussion of which countries saw the largest changes in elephant population 
over this 18-year period. 

3. The mean annual change in elephant population for each country from 2007 to 2012, 
and a discussion of which countries saw the largest changes in elephant population 
over this 5-year period. 

4. A comparison of your results from parts 1, 2, and 3, and a discussion of the conclu- 
sions you can draw from this comparison. 
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172 Chapter 4 


Introduction to Probability 


STATISTICS IN PRACTICE 


National Aeronautics and Space Administration* 
WASHINGTON, DC 


The National Aeronautics and Space Administration 
(NASA) is the agency of the United States government 
that is responsible for the U.S. civilian space program 
and aeronautics and aerospace research. NASA is best 
known for its manned space exploration; its mission 
statement is to “drive advances in science, technol- 
ogy, aeronautics, and space exploration to enhance 
knowledge, education, innovation, economic vitality 
and stewardship of Earth.” NASA, with more than 
17,000 employees, oversees many different space- 
based missions including work on the International 
Space Station, exploration beyond our solar system 
with the Hubble telescope, and planning for possible 
future astronaut missions to the moon and Mars. 

Although NASA's primary mission is space explo- 
ration, their expertise has been called upon to assist 
countries and organizations throughout the world. 

In one such situation, the San José copper and gold 
mine in Copiapó, Chile, caved in, trapping 33 men 
more than 2000 feet underground. While it was impor- 
tant to bring the men safely to the surface as quickly 
as possible, it was imperative that the rescue effort be 
carefully designed and implemented to save as many 
miners as possible. The Chilean government asked 
NASA to provide assistance in developing a rescue 
method. In response, NASA sent a four-person team 
consisting of an engineer, two physicians, and a psy- 
chologist with expertise in vehicle design and issues of 
long-term confinement. 

The probability of success and failure of various 
rescue methods was prominent in the thoughts of 
everyone involved. Since there were no historical data 
available that applied to this unique rescue situation, 
NASA scientists developed subjective probability 


*The authors are indebted to Dr. Michael Duncan and Clinton Cragg at 
NASA for providing this Statistics in Practice. 


NASA scientists based probabilities on similar 
circumstances experienced during space flights. 
REUTERS/Hugo Infante/Government of Chile/Handout 


estimates for the success and failure of various rescue 
methods based on similar circumstances experienced 
by astronauts returning from short- and long-term space 
missions. The probability estimates provided by NASA 
guided officials in the selection of a rescue method and 
provided insight as to how the miners would survive the 
ascent in a rescue cage. 

The rescue method designed by the Chilean officials 
in consultation with the NASA team resulted in the 
construction of a 13-foot-long, 924-pound steel rescue 
capsule that would be used to bring up the miners one 
at a time. All miners were rescued, with the last miner 
emerging 68 days after the cave-in occurred. 

In this chapter you will learn about probability as 
well as how to compute and interpret probabilities for a 
variety of situations. In addition to subjective probabili- 
ties, you will learn about classical and relative frequency 
methods for assigning probabilities. The basic relation- 
ships of probability, conditional probability, and Bayes’ 
theorem will be covered. 


Managers often base their decisions on an analysis of uncertainties such as the following: 


1. What are the chances that sales will decrease if we increase prices? 
2. What is the likelihood a new assembly method will increase productivity? 
3. How likely is it that the project will be finished on time? 


Some of the earliest work 
on probability originated in 
a series of letters between 
Pierre de Fermat and 


Blaise Pascal in the 1650s. of each event occurring. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


4. What is the chance that a new investment will be profitable? 


Probability is a numerical measure of the likelihood that an event will occur. Thus, 
probabilities can be used as measures of the degree of uncertainty associated with the 
four events previously listed. If probabilities are available, we can determine the likelihood 
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FIGURE 4.1 Probability as a Numerical Measure of the Likelihood of an 


Event Occurring 


Increasing Likelihood of Occurrence 


0 5 1.0 
Probability: L— ———~———~ d S SE 


t 


The occurrence of the event is 
just as likely as it is unlikely. 


Probability values are always assigned on a scale from 0 to 1. A probability near zero 
indicates an event is unlikely to occur; a probability near 1 indicates an event is almost 
certain to occur. Other probabilities between 0 and 1 represent degrees of likelihood that 
an event will occur. For example, if we consider the event “rain tomorrow,” we under- 
stand that when the weather report indicates “a near-zero probability of rain,” it means 
almost no chance of rain. However, if a .90 probability of rain is reported, we know that 
rain is likely to occur. A .50 probability indicates that rain is just as likely to occur as not. 
Figure 4.1 depicts the view of probability as a numerical measure of the likelihood of an 
event occurring. 


4.1 Experiments, Counting Rules, 
and Assigning Probabilities 


In discussing probability, we define an experiment as a process that generates well-defined 
outcomes. On any single repetition of an experiment, one and only one of the possible 
experimental outcomes will occur. Several examples of experiments and their associated 
outcomes follow. 


Experiment Experimental Outcomes 
Toss a coin Head, tail 

Select a part for inspection Defective, nondefective 
Conduct a sales call Purchase, no purchase 

Roll a die 1,2, 3, 4, 8, G 

Play a football game Win, lose, tie 


By specifying all possible experimental outcomes, we identify the sample space for an 
experiment. 


SAMPLE SPACE 


The sample space for an experiment is the set of all experimental outcomes. 


Experimental outcomes are An experimental outcome is also called a sample point to identify it as an element of the 
also called sample points. | sample space. 
Consider the first experiment in the preceding table—tossing a coin. The upward face 
of the coin—a head or a tail—determines the experimental outcomes (sample points). 
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If we let S denote the sample space, we can use the following notation to describe the 
sample space. 


S = {Head, Tail} 


The sample space for the second experiment in the table—selecting a part for inspection— 
can be described as follows: 


S = {Defective, Nondefective} 


Both of these experiments have two experimental outcomes (sample points). However, 
suppose we consider the fourth experiment listed in the table—rolling a die. The possible 
experimental outcomes, defined as the number of dots appearing on the upward face of the 
die, are the six points in the sample space for this experiment. 


S = {1, 2,3, 4,5, 6} 


Counting Rules, Combinations, and Permutations 


Being able to identify and count the experimental outcomes is a necessary step in assigning 
probabilities. We now discuss three useful counting rules. 


Multiple-Step Experiments The first counting rule applies to multiple-step experiments. 
Consider the experiment of tossing two coins. Let the experimental outcomes be defined in 
terms of the pattern of heads and tails appearing on the upward faces of the two coins. How 
many experimental outcomes are possible for this experiment? The experiment of tossing 
two coins can be thought of as a two-step experiment in which step | is the tossing of the 
first coin and step 2 is the tossing of the second coin. If we use H to denote a head and T to 
denote a tail, (H, H ) indicates the experimental outcome with a head on the first coin and 

a head on the second coin. Continuing this notation, we can describe the sample space (S) 
for this coin-tossing experiment as follows: 


S = {(4, H), (HE, T), (T, H), (T, T)} 


Thus, we see that four experimental outcomes are possible. In this case, we can easily list 
all the experimental outcomes. 

The counting rule for multiple-step experiments makes it possible to determine the num- 
ber of experimental outcomes without listing them. 


COUNTING RULE FOR MULTIPLE-STEP EXPERIMENTS 


If an experiment can be described as a sequence of k steps with n, possible outcomes 
on the first step, n, possible outcomes on the second step, and so on, then the total 
number of experimental outcomes is given by (1,) (m) . . . (n). 


Viewing the experiment of tossing two coins as a sequence of first tossing one coin 

(n; = 2) and then tossing the other coin (n, = 2), we can see from the counting rule that 

(2)(2) = 4 distinct experimental outcomes are possible. As shown, they are S = {(H, H), 

(H, T), (T, H), (T, T)}. The number of experimental outcomes in an experiment involving 
Without the tree diagram, tossing six coins is (2)(2)(2)(2)(2)(2) = 64. 
one might think only three A tree diagram is a graphical representation that helps in visualizing a multiple-step 
experimental outcomes are experiment. Figure 4.2 shows a tree diagram for the experiment of tossing two coins. The 
possible for two tosses ofa sequence of steps moves from left to right through the tree. Step 1 corresponds to toss- 
coin: 0 heads, 1 head, and ing the first coin, and step 2 corresponds to tossing the second coin. For each step, the 
2 heads. two possible outcomes are head or tail. Note that for each possible outcome at step | two 
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FIGURE 4.2 Tree Diagram for the Experiment of Tossing Two Coins 


Experimental 
Step 2 Outcome 


Step 1 l 
Second Coin l (sample point) 
I 


First Coin 


(H, H) 


(H, T) 


(7, H) 


(7, T) 


branches correspond to the two possible outcomes at step 2. Each of the points on the right 
end of the tree corresponds to an experimental outcome. Each path through the tree from 
the leftmost node to one of the nodes at the right side of the tree corresponds to a unique 
sequence of outcomes. 

Let us now see how the counting rule for multiple-step experiments can be used in 
the analysis of a capacity expansion project for the Kentucky Power & Light Company 
(KP&L). KP&L is starting a project designed to increase the generating capacity of one of 
its plants in northern Kentucky. The project is divided into two sequential stages or steps: 
stage 1 (design) and stage 2 (construction). Even though each stage will be scheduled and 
controlled as closely as possible, management cannot predict beforehand the exact time 
required to complete each stage of the project. An analysis of similar construction projects 
revealed possible completion times for the design stage of 2, 3, or 4 months and possible 
completion times for the construction stage of 6, 7, or 8 months. In addition, because of the 
critical need for additional electrical power, management set a goal of 10 months for the 
completion of the entire project. 

Because this project has three possible completion times for the design stage (step 1) 
and three possible completion times for the construction stage (step 2), the counting 
rule for multiple-step experiments can be applied here to determine a total of (3)(3) = 9 
experimental outcomes. To describe the experimental outcomes, we use a two-number 
notation; for instance, (2, 6) indicates that the design stage is completed in 2 months and 
the construction stage is completed in 6 months. This experimental outcome results in a 
total of 2 + 6 = 8 months to complete the entire project. Table 4.1 summarizes the nine 
experimental outcomes for the KP&L problem. The tree diagram in Figure 4.3 shows how 
the nine outcomes (sample points) occur. 

The counting rule and tree diagram help the project manager identify the experimen- 
tal outcomes and determine the possible project completion times. From the information 
in Figure 4.3, we see that the project will be completed in 8 to 12 months, with six of 
the nine experimental outcomes providing the desired completion time of 10 months or 
less. Even though identifying the experimental outcomes may be helpful, we need to 
consider how probability values can be assigned to the experimental outcomes before 
making an assessment of the probability that the project will be completed within the 
desired 10 months. 
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TABLE 4.1 Experimental Outcomes (Sample Points) for the KP&L Project 


Completion Time (months) Notation for Total Project 
Stage 1 Stage 2 Experimental Completion 
Design Construction Outcome Time (months) 

2 6 (2, 6) 8 

2 (2, 7) 9 

2 8 (2, 8) 10 

3 6 (3, 6) 9 

3 7 (3, 7) 10 

3 8 (3, 8) 11 

4 6 (4, 6) 10 

4 7 (4, 7) 11 

4 8 (4, 8) 12 


FIGURE 4.3 Tree Diagram for the KP&L Project 


! ! Experimental 
Step 1 l Step 2 ! Outcome Total Project 
Design l Construction l (Sample Point) Completion Time 
I I 
2,6) Sine ais 
I 
i 
I 
(2, 7) 9 months 
(2, 8) 10 months 
(3, 6) 9 months 
(3, 7) 10 months 
(3, 8) 11 months 
(4, 6) 10 months 
(Gs 7D) 11 months 
I 
I 
I 
(4, 8) 12 months 


Combinations A second useful counting rule allows one to count the number of experi- 
mental outcomes when the experiment involves selecting n objects from a set of N objects. 
It is called the counting rule for combinations. 
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The notation ! 
means factorial; 
for example, 5 
factorial is 5! = 
(5)(4)(3)(2)(1) = 120. 


In sampling from a finite 
population of size N, the 
counting rule for com- 
binations is used to find 
the number of different 
samples of size n that can 
be selected. 
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COUNTING RULE FOR COMBINATIONS 


The number of combinations of N objects taken n at a time is 


N N! 
= Q cs n\(N — n)! aD 
where N! = NWN — 1)(N - 2) - (21) 


n! = n(n — 1)(n — 2) --> (2)(1) 


and, by definition, 0!=1 


As an illustration of the counting rule for combinations, consider a quality control 
procedure in which an inspector randomly selects two of five parts to test for defects. In 
a group of five parts, how many combinations of two parts can be selected? The counting 
rule in equation (4.1) shows that with N = 5 and n = 2, we have 


5 G) 5! 6EXACGDA) 120 
: 2 25 — 2)!  (2)(1)(3)(2)(1) 12 
Thus, 10 outcomes are possible for the experiment of randomly selecting two parts from a 
group of five. If we label the five parts as A, B, C, D, and E, the 10 combinations or experi- 
mental outcomes can be identified as AB, AC, AD, AE, BC, BD, BE, CD, CE, and DE. 

As another example, consider that the Florida lottery system uses the random selection 
of 6 integers from a group of 53 to determine the weekly winner. The counting rule for 
combinations, equation (4.1), can be used to determine the number of ways 6 different 
integers can be selected from a group of 53. 


53 53! 53! (53)(52)(51)(50)(49)(48) 
p ) 6153-6)! 6147! ODADA — EA 


The counting rule for combinations tells us that almost 23 million experimental outcomes 
are possible in the lottery drawing. An individual who buys a lottery ticket has 1 chance in 
22,957,480 of winning. 


Permutations A third counting rule that is sometimes useful is the counting rule for 
permutations. It allows one to compute the number of experimental outcomes when 
n objects are to be selected from a set of N objects where the order of selection is 
important. The same n objects selected in a different order are considered a different 
experimental outcome. 


COUNTING RULE FOR PERMUTATIONS 


The number of permutations of N objects taken n at a time is given by 


N N! 
pY n(™) a (4.2) 


The counting rule for permutations closely relates to the one for combinations; 
however, an experiment results in more permutations than combinations for the 
same number of objects because every selection of n objects can be ordered in n! 
different ways. 
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As an example, consider again the quality control process in which an inspector selects 
two of five parts to inspect for defects. How many permutations may be selected? The 
counting rule in equation (4.2) shows that with N = 5 and n = 2, we have 


5! 5! (5)(4)(3)(2)1) 120 


a Go a BDA) 6 mN 


Thus, 20 outcomes are possible for the experiment of randomly selecting two parts from a 
group of five when the order of selection must be taken into account. If we label the parts 
A, B, C, D, and E, the 20 permutations are AB, BA, AC, CA, AD, DA, AE, EA, BC, CB, 
BD, DB, BE, EB, CD, DC, CE, EC, DE, and ED. 


Assigning Probabilities 

Now let us see how probabilities can be assigned to experimental outcomes. The three ap- 
proaches most frequently used are the classical, relative frequency, and subjective methods. 
Regardless of the method used, two basic requirements for assigning probabilities must 
be met. 


BASIC REQUIREMENTS FOR ASSIGNING PROBABILITIES 


1. The probability assigned to each experimental outcome must be between 0 and 
1, inclusively. If we let E, denote the ith experimental outcome and P(E,) its 
probability, then this requirement can be written as 


0 = P(E) = 1 for alli (4.3) 


2. The sum of the probabilities for all the experimental outcomes must equal 1.0. 
For n experimental outcomes, this requirement can be written as 


PENERE ae eee ae 7205.) = 1) (4.4) 


The classical method of assigning probabilities is appropriate when all the experimen- 
tal outcomes are equally likely. If n experimental outcomes are possible, a probability of 
1/n is assigned to each experimental outcome. When using this approach, the two basic 
requirements for assigning probabilities are automatically satisfied. 

For an example, consider the experiment of tossing a fair coin; the two experimental 
outcomes—head and tail—are equally likely. Because one of the two equally likely out- 
comes is a head, the probability of observing a head is 1/2, or .50. Similarly, the proba- 
bility of observing a tail is also 1/2, or .50. 

As another example, consider the experiment of rolling a die. It would seem reasonable 
to conclude that the six possible outcomes are equally likely, and hence each outcome is 
assigned a probability of 1/6. If P(1) denotes the probability that one dot appears on the 
upward face of the die, then P(1) = 1/6. Similarly, P(2) = 1/6, P(3) = 1/6, P(4) = 1/6, 
P(5) = 1/6, and P(6) = 1/6. Note that these probabilities satisfy the two basic requirements 
of equations (4.3) and (4.4) because each of the probabilities is greater than or equal to 
zero and they sum to 1.0. 

The relative frequency method of assigning probabilities is appropriate when data 
are available to estimate the proportion of the time the experimental outcome will occur 
if the experiment is repeated a large number of times. As an example, consider a study of 
waiting times in the X-ray department for a local hospital. A clerk recorded the num- 
ber of patients waiting for service at 9:00 a.m. on 20 successive days and obtained the 
following results. 
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Bayes’ theorem (see 
Section 4.5) provides a 
means for combining sub- 
jectively determined prior 
probabilities with proba- 
bilities obtained by other 
means to obtain revised, 
or posterior, probabilities. 
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Number of Days 


Number Waiting Outcome Occurred 
0 2 
1 5 
2 6 
3 4 
4 2 
Total 20 


These data show that on 2 of the 20 days, zero patients were waiting for service; on 5 
of the days, one patient was waiting for service; and so on. Using the relative frequency 
method, we would assign a probability of 2/20 = .10 to the experimental outcome of zero 
patients waiting for service, 5/20 = .25 to the experimental outcome of one patient waiting, 
6/20 = .30 to two patients waiting, 4/20 = .20 to three patients waiting, and 3/20 = .15 to 
four patients waiting. As with the classical method, using the relative frequency method 
automatically satisfies the two basic requirements of equations (4.3) and (4.4). 

The subjective method of assigning probabilities is most appropriate when one cannot 
realistically assume that the experimental outcomes are equally likely and when little rel- 
evant data are available. When the subjective method is used to assign probabilities to the 
experimental outcomes, we may use any information available, such as our experience or 
intuition. After considering all available information, a probability value that expresses our 
degree of belief (on a scale from 0 to 1) that the experimental outcome will occur is spec- 
ified. Because subjective probability expresses a person’s degree of belief, it is personal. 
Using the subjective method, different people can be expected to assign different proba- 
bilities to the same experimental outcome. 

The subjective method requires extra care to ensure that the two basic requirements of 
equations (4.3) and (4.4) are satisfied. Regardless of a person’s degree of belief, the proba- 
bility value assigned to each experimental outcome must be between 0 and 1, inclusive, 
and the sum of all the probabilities for the experimental outcomes must equal 1.0. 

Consider the case in which Tom and Judy Elsbernd make an offer to purchase a house. 
Two outcomes are possible: 


E, = their offer is accepted 
E, = their offer is rejected 


Judy believes that the probability their offer will be accepted is .8; thus, Judy would set 
P(E,) = .8 and P(E,) = .2. Tom, however, believes that the probability that their offer will 
be accepted is .6; hence, Tom would set P(E,) = .6 and P(E,) = .4. Note that Tom’s proba- 
bility estimate for E, reflects a greater pessimism that their offer will be accepted. 

Both Judy and Tom assigned probabilities that satisfy the two basic requirements. The 
fact that their probability estimates are different emphasizes the personal nature of the 
subjective method. 

Even in business situations where either the classical or the relative frequency approach 
can be applied, managers may want to provide subjective probability estimates. In such 
cases, the best probability estimates often are obtained by combining the estimates from 
the classical or relative frequency approach with subjective probability estimates. 


Probabilities for the KP&L Project 


To perform further analysis on the KP&L project, we must develop probabilities for each 
of the nine experimental outcomes listed in Table 4.1. On the basis of experience and 
judgment, management concluded that the experimental outcomes were not equally likely. 
Hence, the classical method of assigning probabilities could not be used. Management 
then decided to conduct a study of the completion times for similar projects undertaken by 
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TABLE 4.2 Completion Results for 40 KP&L Projects 


Number of 
Completion Time (months) Past Projects 
Stage 1 Stage 2 Having These 
Design Construction Sample Point Completion Times 
2 6 (2, 6) 6 
2 7 (2, 7) 6 
2 8 (2, 8) 2 
3 6 (3,0) 4 
3 7 (STA) 8 
3 8 (3, 8) 2 
4 6 (4, 6) 2 
4 7 (4, 7) 4 
4 8 (4, 8) 46 
Total 40 


KP&L over the past three years. The results of a study of 40 similar projects are summar- 
ized in Table 4.2. 

After reviewing the results of the study, management decided to employ the relative 
frequency method of assigning probabilities. Management could have provided subjective 
probability estimates but felt that the current project was quite similar to the 40 previous 
projects. Thus, the relative frequency method was judged best. 

In using the data in Table 4.2 to compute probabilities, we note that outcome (2, 6)—stage 
1 completed in 2 months and stage 2 completed in 6 months—occurred six times in the 40 
projects. We can use the relative frequency method to assign a probability of 6/40 = .15 to 
this outcome. Similarly, outcome (2, 7) also occurred in six of the 40 projects, providing a 
6/40 = .15 probability. Continuing in this manner, we obtain the probability assignments for 
the sample points of the KP&L project shown in Table 4.3. Note that P(2, 6) represents the 
probability of the sample point (2, 6), P(2, 7) represents the probability of the sample point 
(2, 7), and so on. 


TABLE 4.3 Probability Assignments for the KP&L Project Based 
on the Relative Frequency Method 


Project Probability 
Sample Point Completion Time of Sample Point 

(2, 6) 8 months P(2, 6) = 6/40 = 15 
(2a) 9 months P(2, 7) = 6/40 = .15 
(2, 8) 10 months P279) 92/4 0 = 05 
(3, 6) 9 months RE O = 420) = 110) 
(Gr?) 10 months P(3, 7) = 8/40 = .20 
(3, 8) 11 months Geo) 2/4005 
(4, 6) 10 months RATO = 2/40) — 205 
(4, 7) 11 months P(4, 7) = 4/40 = .10 
(4, 8) 12 months P48) = 6/40 = 15 

Total 1.00 
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NOTES + COMMENTS 


1. 


In statistics, the notion of an experiment differs somewhat 
from the notion of an experiment in the physical sciences. 
In the physical sciences, researchers usually conduct an 
experiment in a laboratory or a controlled environment in 
order to learn about cause and effect. In statistical experi- 
ments, probability determines outcomes. Even though the 
experiment is repeated in exactly the same way, an entirely 


different outcome may occur. Because of this influence of 
probability on the outcome, the experiments of statistics 
are sometimes called random experiments. 


2. When drawing a random sample without replacement from 


a population of size N, the counting rule for combinations 
is used to find the number of different samples of size n that 
can be selected. 
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EXERCISES 


Methods 


L. 


An experiment has three steps with three outcomes possible for the first step, two out- 
comes possible for the second step, and four outcomes possible for the third step. How 
many experimental outcomes exist for the entire experiment? 

How many ways can three items be selected from a group of six items? Use the letters 
A, B, C, D, E, and F to identify the items, and list each of the different combinations 
of three items. 

How many permutations of three items can be selected from a group of six? Use the letters 
A, B, C, D, E, and F to identify the items, and list each of the permutations of items B, D, 
and F. 

Consider the experiment of tossing a coin three times. 

a. Develop a tree diagram for the experiment. 

b. List the experimental outcomes. 

c. What is the probability for each experimental outcome? 

Suppose an experiment has five equally likely outcomes: E}, F,, E3, E4, E5. Assign 
probabilities to each outcome and show that the requirements in equations (4.3) and 
(4.4) are satisfied. What method did you use? 

An experiment with three outcomes has been repeated 50 times, and it was learned 
that E, occurred 20 times, E, occurred 13 times, and E, occurred 17 times. Assign 
probabilities to the outcomes. What method did you use? 

A decision maker subjectively assigned the following probabilities to the four out- 
comes of an experiment: P(E,) = .10, P(E,) = .15, P(E;) = .40, and P(E,) = .20. Are 
these probability assignments valid? Explain. 


Applications 


8. 


Zoning Changes. In the city of Milford, applications for zoning changes go through 

a two-step process: a review by the planning commission and a final decision by the 
city council. At step 1 the planning commission reviews the zoning change request and 
makes a positive or negative recommendation concerning the change. At step 2 the city 
council reviews the planning commission’s recommendation and then votes to approve 
or to disapprove the zoning change. Suppose the developer of an apartment complex 
submits an application for a zoning change. Consider the application process as an 
experiment. 

a. How many sample points are there for this experiment? List the sample points. 

b. Construct a tree diagram for the experiment. 

Sampling Bank Accounts. Simple random sampling uses a sample of size n from a 
population of size N to obtain data that can be used to make inferences about the char- 
acteristics of a population. Suppose that, from a population of 50 bank accounts, we 
want to take a random sample of four accounts in order to learn about the population. 
How many different random samples of four accounts are possible? 
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<> F 10. Code Churn. Code Churn is a common metric used to measure the efficiency and 

© DATA file i i i 

= productivity of software engineers and computer programmers. It’s usually measured 
Codechurn as the percentage of a programmer’s code that must be edited over a short period of 


time. Programmers with higher rates of code churn must rewrite code more often be- 
cause of errors and inefficient programming techniques. The following table displays 
sample information for 10 computer programmers. 


Total Lines of Code Number of Lines of Code 
Programmer Written Requiring Edits 
Liwei 23,789 4,589 
Andrew 17,962 2,780 
Jaime 30025 12,080 
Sherae 26,050 3,780 
Binny 19,586 1,890 
Roger 24,786 4,005 
Dong-Gil 24,030 DIS 
Alex 14,780 1052 
Jay 30,875 3,872 
Vivek 21,546 425 


a. Use the data in the table above and the relative frequency method to determine 
probabilities that a randomly selected line of code will need to be edited for each 
programmer. 

b. If you randomly select a line of code from Liwei, what is the probability that the 
line of code will require editing? 

c. If you randomly select a line of code from Sherae, what is the probability that the 
line of code will not require editing? 

d. Which programmer has the lowest probability of a randomly selected line of code 
requiring editing? Which programmer has the highest probability of a randomly 
selected line of code requiring editing? 

11. Tri-State Smokers. A 2015 Gallup Poll of U.S. adults indicated that Kentucky is the 
state with the highest percentage of smokers (Gallup website). Consider the following 
example data from the Tri-State region, an area that comprises northern Kentucky, 
southeastern Indiana, and southwestern Ohio. 


State Smoker Non-Smoker 
Kentucky 47 176 
Indiana 32 134 
Ohio 39 182 
Total: 118 492 


a. Use the data to compute the probability that an adult in the Tri-State region smokes. 

b. What is the probability of an adult in each state of the Tri-State region being a 
smoker? Which state in the Tri-State region has the lowest probability of an adult 
being a smoker? 

12. Toothpaste Package Designs. A company that manufactures toothpaste is study- 
ing five different package designs. Assuming that one design is just as likely to be 
selected by a consumer as any other design, what selection probability would you 
assign to each of the package designs? In an actual experiment, 100 consumers 
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were asked to pick the design they preferred. The following data were obtained. Do 
the data confirm the belief that one design is just as likely to be selected as an- 
other? Explain. 


Number of 
Design Times Preferred 


5 
15 
30 
40 


10 


a ABAUN- 


13. Powerball Lottery. The Powerball lottery is played twice each week in 44 states, the 
District of Columbia, and the Virgin Islands. To play Powerball, a participant must 
purchase a $2 ticket, select five numbers from the digits 1 through 69, and then select a 
Powerball number from the digits 1 through 26. To determine the winning numbers for 
each game, lottery officials draw 5 white balls out a drum of 69 white balls numbered 
1 through 69 and 1 red ball out of a drum of 26 red balls numbered 1 through 26. To 
win the Powerball jackpot, a participant’s numbers must match the numbers on the 
5 white balls in any order and must also match the number on the red Powerball. The 
numbers 4-8-19-27-34 with a Powerball number of 10 provided the record jackpot of 
$1.586 billion (Powerball website). 

a. How many Powerball lottery outcomes are possible? (Hint: Consider this a 
two-step random experiment. Select the 5 white ball numbers and then select the 
1 red Powerball number.) 

b. What is the probability that a $2 lottery ticket wins the Powerball lottery? 


4.2 Events and Their Probabilities 


In the introduction to this chapter we used the term event much as it would be used in 
everyday language. Then, in Section 4.1 we introduced the concept of an experiment and 
its associated experimental outcomes or sample points. Sample points and events provide 
the foundation for the study of probability. As a result, we must now introduce the formal 
definition of an event as it relates to sample points. Doing so will provide the basis for 
determining the probability of an event. 


EVENT 


An event is a collection of sample points. 


For an example, let us return to the KP&L project and assume that the project manager 
is interested in the event that the entire project can be completed in 10 months or less. 
Referring to Table 4.3, we see that six sample points—(2, 6), (2, 7), (2, 8), (3, 6), (3, 7), 
and (4, 6)—provide a project completion time of 10 months or less. Let C denote the event 
that the project is completed in 10 months or less; we write 


C = {(2, 6), (2, 7), 2, 8), (3, 6), (3, 7), (4, 6)} 


Event C is said to occur if any one of these six sample points appears as the experimental 
outcome. 
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Other events that might be of interest to KP&L management include the following. 


L = The event that the project is completed in /ess than 10 months 
M = The event that the project is completed in more than 10 months 


Using the information in Table 4.3, we see that these events consist of the following sample 
points. 


L = {(2, 6), (2, 7), (3, 6)} 
M = {(3, 8), (4, 7), (4, 8)} 


A variety of additional events can be defined for the KP&L project, but in each case the 
event must be identified as a collection of sample points for the experiment. 

Given the probabilities of the sample points shown in Table 4.3, we can use the follow- 
ing definition to compute the probability of any event that KP&L management might want 
to consider. 


PROBABILITY OF AN EVENT 


The probability of any event is equal to the sum of the probabilities of the sample 
points in the event. 


Using this definition, we calculate the probability of a particular event by adding the 
probabilities of the sample points (experimental outcomes) that make up the event. We can 
now compute the probability that the project will take 10 months or less to complete. Be- 
cause this event is given by C = {(2, 6), (2, 7), (2, 8), (3, 6), (3, 7), (4, 6)}, the probability 
of event C, denoted P(C), is given by 


P(C) = P(2, 6) + P(2, 7) + P(2, 8) + P(3, 6) + P(3, 7) + P(4, 6) 
Refer to the sample point probabilities in Table 4.3; we have 
P(C) = .15 + .15 + .05 + .10 + .20 + .05 = .70 


Similarly, because the event that the project is completed in less than 10 months is given 
by L = {(2, 6), (2, 7), (3, 6)}, the probability of this event is given by 


P(L) = P(2, 6) + P(2, 7) + PG, 6) 
= 15+ .15 + .10 = 40 


Finally, for the event that the project is completed in more than 10 months, we have M = 
{(3, 8), (4, 7), (4, 8)} and thus 


P(M) = P(3, 8) + P(4, 7) + P(4, 8) 
= .05 + .10 + .15 = .30 


Using these probability results, we can now tell KP&L management that there is a .70 
probability that the project will be completed in 10 months or less, a .40 probability that 
the project will be completed in less than 10 months, and a .30 probability that the project 
will be completed in more than 10 months. This procedure of computing event probabili- 
ties can be repeated for any event of interest to the KP&L management. 

Any time that we can identify all the sample points of an experiment and assign proba- 
bilities to each, we can compute the probability of an event using the definition. However, 
in many experiments the large number of sample points makes the identification of the 
sample points, as well as the determination of their associated probabilities, extremely 
cumbersome, if not impossible. In the remaining sections of this chapter, we present some 
basic probability relationships that can be used to compute the probability of an event with- 
out knowledge of all the sample point probabilities. 
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NOTES + COMMENTS 


1. The sample space, S, is an event. Because it contains all equally likely. In such cases, the probability of an event can 
the experimental outcomes, it has a probability of 1; that be computed by counting the number of experimental 
is, P(S) = outcomes in the event and dividing the result by the total 


2. When the classical method is used to assign probabilit- number of experimental outcomes. 


ies, the assumption is that the experimental outcomes are 


EXERCISES 


Methods 
14. An experiment has four equally likely outcomes: E}, E,, E;, and Ey. 


a. 
b. 
C: 


What is the probability that E, occurs? 

What is the probability that any two of the outcomes occur (e.g., E; or £3)? 
What is the probability that any three of the outcomes occur (e.g., E; or E, 
or E,)? 


15. Consider the experiment of selecting a playing card from a deck of 52 playing cards. 
Each card corresponds to a sample point with a 1/52 probability. 


a. 


List the sample points in the event an ace is selected. 


b. List the sample points in the event a club is selected. 
c. 
d. Find the probabilities associated with each of the events in parts (a), (b), 


List the sample points in the event a face card (jack, queen, or king) is selected. 


and (c). 


16. Consider the experiment of rolling a pair of dice. Suppose that we are interested in the 
sum of the face values showing on the dice. 


a. 


papp 


How many sample points are possible? (Hint: Use the counting rule for multiple-step 
experiments.) 

List the sample points. 

What is the probability of obtaining a value of 7? 


. What is the probability of obtaining a value of 9 or greater? 


Because each roll has six possible even values (2, 4, 6, 8, 10, and 12) and only five 
possible odd values (3, 5, 7, 9, and 11), the dice should show even values more 
often than odd values. Do you agree with this statement? Explain. 

What method did you use to assign the probabilities requested? 


Applications 
17. KP&L Project Over Budget. Refer to the KP&L sample points and sample point 
probabilities in Tables 4.2 and 4.3. 


a. 


b. 
Č: 


d. 


e. 


The design stage (stage 1) will run over budget if it takes four months to complete. 
List the sample points in the event the design stage is over budget. 

What is the probability that the design stage is over budget? 

The construction stage (stage 2) will run over budget if it takes eight months to 
complete. List the sample points in the event the construction stage is 

over budget. 

What is the probability that the construction stage is over budget? 

What is the probability that both stages are over budget? 


18. Corporate Headquarters Locations. Each year Fortune magazine publishes 
an annual list of the 500 largest companies in the United States. The corporate 
headquarters for the 500 companies are located in 38 different states The follow- 
ing table shows the eight states with the largest number of Fortune 500 companies 
(Money/CNN website). 
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Number of Number of 
State Companies State Companies 
California 53 Ohio 28 
Illinois 32 Pennsylvania 23 
New Jersey 21 Texas 52 
New York 50 Virginia 24 


Suppose one of the 500 companies is selected at random for a follow-up questionnaire. 

a. What is the probability that the company selected has its corporate headquarters in 
California? 

b. What is the probability that the company selected has its corporate headquarters in 
California, New York, or Texas? 

c. What is the probability that the company selected has its corporate headquarters in 
one of the eight states listed above? 

19. Impact of Global Warming. A CBS News/New York Times poll of 1,000 adults in the 
United States asked the question, “Do you think global warming will have an impact 
on you during your lifetime?” (CBS News website). Consider the responses by age 
groups shown below. 


Age 
Response 18-29 30+ 
Yes 134 293 
No 131 432 
Unsure 2 8 


a. What is the probability that a respondent 18-29 years of age thinks that global 
warming will not have an impact during his/her lifetime? 

b. What is the probability that a respondent 30+ years of age thinks that global warm- 
ing will not have an impact during his/her lifetime? 

c. For a randomly selected respondent, what is the probability that a respondent 
answers yes? 

d. Based on the survey results, does there appear to be a difference between ages 
18-29 and 30+ regarding concern over global warming? 

20. Age of Financial Independence. Suppose that the following table represents a sample 
of 944 teenagers’ responses to the question, “When do you think you will become 
financially independent?” 


Age 

Financially Number of 
Independent Responses 
16 to 20 191 

21 to 24 467 

25 to 27 244 

28 or older 42 


Consider the experiment of randomly selecting a teenager from the population of 

teenagers aged 14 to 18. 

a. Compute the probability of being financially independent for each of the four age 
categories. 

b. What is the probability of being financially independent before the age of 25? 
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c. What is the probability of being financially independent after the age of 24? 
d. Do the probabilities suggest that the teenagers may be somewhat unrealistic in their 
expectations about when they will become financially independent? 

21. Fatal Collisions with a Fixed Object. The National Highway Traffic Safety 
Administration (NHTSA) collects traffic safety-related data for the U.S. Department of 
Transportation. According to NHTSA’s data, 10,426 fatal collisions in 2016 were the 
result of collisions with fixed objects (NHTSA website, https://www.careforcrashvictims 
.com/wp-content/uploads/2018/07/Traffic-Safety-Facts-2016_-Motor-Vehicle-Crash 
-Data-from-the-Fatality-Analysis-Reporting-System-FARS-and-the-General-Estimates 
-System-GES.pdf). The following table provides more information on these collisions. 


Fixed Object Involved in Collision Number of Collisions 
Pole/post 1,416 
Culvert/curb/ditch 2,516 
Shrubbery/tree 2,585 
Guardrail 896 
Embankment 947 
Bridge 231 
Other/unknown 1,835 


Assume that a collision will be randomly chosen from this population. 

a. What is the probability of a fatal collision with a pole or post? 

b. What is the probability of a fatal collision with a guardrail? 

c. What type of fixed object is least likely to be involved in a fatal collision? What is 
the probability associated with this type of fatal collision? 

d. What type of object is most likely to be involved in a fatal collision? What is the 
probability associated with this type of fatal collision? 


4.3 Some Basic Relationships of Probability 
Complement of an Event 


Given an event A, the complement of A is defined to be the event consisting of all sample points 
that are not in A. The complement of A is denoted by A‘. Figure 4.4 is a diagram, known as a 
Venn diagram, that illustrates the concept of a complement. The rectangular area represents 
the sample space for the experiment and as such contains all possible sample points. The circle 
represents event A and contains only the sample points that belong to A. The shaded region of 
the rectangle contains all sample points not in event A and is by definition the complement of A. 
In any probability application, either event A or its complement A‘ must occur. There- 
fore, we have 


P(A) + P(A’) = 1 


FIGURE 4.4 Complement of Event A Is Shaded 


Sample Space S$ 


Event A 


Complement 
of Event A 
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Solving for P(A), we obtain the following result. 


COMPUTING PROBABILITY USING THE COMPLEMENT 


P(A) = 1 — P(A‘) (4.5) 


Equation (4.5) shows that the probability of an event A can be computed easily if the 
probability of its complement, P(A‘), is known. 

As an example, consider the case of a sales manager who, after reviewing sales re- 
ports, states that 80% of new customer contacts result in no sale. By allowing A to de- 
note the event of a sale and A‘ to denote the event of no sale, the manager is stating that 
P(A‘) = .80. Using equation (4.5), we see that 


P(A) = 1 — P(A‘) = 1 — .80 = .20 


We can conclude that a new customer contact has a .20 probability of resulting in a sale. 

In another example, a purchasing agent states a .90 probability that a supplier will send 
a shipment that is free of defective parts. Using the complement, we can conclude that 
there is a 1 — .90 = .10 probability that the shipment will contain defective parts. 


Addition Law 


The addition law is helpful when we are interested in knowing the probability that at least 
one of two events occurs. That is, with events A and B we are interested in knowing the 
probability that event A or event B or both will occur. 

Before we present the addition law, we need to discuss two concepts related to the com- 
bination of events: the union of events and the intersection of events. Given two events A 
and B, the union of A and B is defined as follows. 


UNION OF TWO EVENTS 


The union of A and B is the event containing all sample points belonging to A or B or 
both. The union is denoted by A U B. 


The Venn diagram in Figure 4.5 depicts the union of events A and B. Note that the two 
circles contain all the sample points in event A as well as all the sample points in event B. 
The fact that the circles overlap indicates that some sample points are contained in both 


A and B. 
Sample Space S 
Event A Event B 
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FIGURE 4.6 Intersection of Events A and B Is Shaded 


Sample Space S$ 


Event A Event B 


The definition of the intersection of A and B follows. 


INTERSECTION OF TWO EVENTS 


Given two events A and B, the intersection of A and B is the event containing the 
sample points belonging to both A and B. The intersection is denoted by A N B. 


The Venn diagram depicting the intersection of events A and B is shown in Figure 4.6. The 
area where the two circles overlap is the intersection; it contains the sample points that are 
in both A and B. 

Let us now continue with a discussion of the addition law. The addition law provides 
a way to compute the probability that event A or event B or both occur. In other words, the 
addition law is used to compute the probability of the union of two events. The addition 
law is written as follows. 


ADDITION LAW 


P(A U B) = P(A) + P(B) — P(A A B) (4.6) 


To understand the addition law intuitively, note that the first two terms in the addition 
law, P(A) + P(B), account for all the sample points in A U B. However, because the sample 
points in the intersection A N B are in both A and B, when we compute P(A) + P(B), we 
are in effect counting each of the sample points in A N B twice. We correct for this over- 
counting by subtracting P(A N B). 

As an example of an application of the addition law, let us consider the case of a small 
assembly plant with 50 employees. Each worker is expected to complete work assignments 
on time and in such a way that the assembled product will pass a final inspection. On 
occasion, some of the workers fail to meet the performance standards by completing work 
late or assembling a defective product. At the end of a performance evaluation period, the 
production manager found that 5 of the 50 workers completed work late, 6 of the 50 work- 
ers assembled a defective product, and 2 of the 50 workers both completed work late and 
assembled a defective product. 

Let 


L = the event that the work is completed late 
D = the event that the assembled product is defective 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


190 Chapter 4 Introduction to Probability 


The relative frequency information leads to the following probabilities. 


5 
P(L) = = =.10 
D= 

6 
P(D) = =~ = .12 
D= 

P(LND) =~ = 04 

50 


After reviewing the performance data, the production manager decided to assign a poor 
performance rating to any employee whose work was either late or defective; thus the 
event of interest is L U D. What is the probability that the production manager assigned an 
employee a poor performance rating? 

Note that the probability question is about the union of two events. Specifically, we 
want to know P(L U D). Using equation (4.6), we have 


P(LUD) = P(L) + P(D) — P(LA D) 
Knowing values for the three probabilities on the right side of this expression, we can write 
P(LUD) = .10 + .12 — .04 = .18 


This calculation tells us that there is a .18 probability that a randomly selected employee 
received a poor performance rating. 

As another example of the addition law, consider a recent study conducted by the 
personnel manager of a major computer software company. The study showed that 30% 
of the employees who left the firm within two years did so primarily because they were 
dissatisfied with their salary, 20% left because they were dissatisfied with their work 
assignments, and 12% of the former employees indicated dissatisfaction with both their 
salary and their work assignments. What is the probability that an employee who leaves 
within two years does so because of dissatisfaction with salary, dissatisfaction with the 
work assignment, or both? 

Let 


S = the event that the employee leaves because of salary 
W = the event that the employee leaves because of work assignment 


We have P(S) = .30, P(W) = .20, and P(S N W) = .12. Using equation (4.6), the addition 
law, we have 


P(S U W) = P(S) + P(W) — P(S N W) = .30 + .20 — .12 = .38 


We find a .38 probability that an employee leaves for salary or work assignment reasons. 
Before we conclude our discussion of the addition law, let us consider a special case that 
arises for mutually exclusive events. 


MUTUALLY EXCLUSIVE EVENTS 


Two events are said to be mutually exclusive if the events have no sample points in 
common. 


Events A and B are mutually exclusive if, when one event occurs, the other cannot occur. 
Thus, a requirement for A and B to be mutually exclusive is that their intersection must con- 
tain no sample points. The Venn diagram depicting two mutually exclusive events A and B is 
shown in Figure 4.7. In this case P(A N B) = 0 and the addition law can be written as follows. 


ADDITION LAW FOR MUTUALLY EXCLUSIVE EVENTS 


P(AUB) = P(A) + P(B) 
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FIGURE 4.7 | Mutually Exclusive Events 


Sample Space $ 


Event A Event B 


EXERCISES 


Methods 
22. Suppose that we have a sample space with five equally likely experimental outcomes: 
E, E>, Ez, Ey, E;. Let 


Find P(A), P(B), and P(C). 

Find P(A U B). Are A and B mutually exclusive? 

Find A‘, C°, P(A‘), and P(C*). 

. Find A U B“ and P(A U B*). 

. Find P(B U C). 

23. Suppose that we have a sample space S = {E}, E5, E3, Ey, Es, Eg, E7}, where E,, 
E,,..., E, denote the sample points. The following probability assignments apply: 
P(E,) = .05, P(E,) = .20, P(E) = .20, P(E,) = .25, PES) = .15, P(E.) = .10, and 
P(E,) = .05. Let 


a 


A = {E,, Ey, Ee} 
B = {E, E, Ey} 
C = {E, E, Es E} 

a. Find P(A), P(B), and P(C). 

b. Find A U B and P(A U B). 

c. Find A N Band P(A A B). 

d. Are events A and C mutually exclusive? 

e. Find B“ and P(B‘). 

Applications 


24. Clarkson University Alumni Survey. Clarkson University surveyed alumni to learn 
more about what they think of Clarkson. One part of the survey asked respondents to 
indicate whether their overall experience at Clarkson fell short of expectations, met 
expectations, or surpassed expectations. The results showed that 4% of the respond- 
ents did not provide a response, 26% said that their experience fell short of expecta- 
tions, and 65% of the respondents said that their experience met expectations. 

a. If we chose an alumnus at random, what is the probability that the alumnus would 
say their experience surpassed expectations? 

b. If we chose an alumnus at random, what is the probability that the alumnus would 
say their experience met or surpassed expectations? 
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25. Americans Using Facebook and LinkedIn. A 2018 Pew Research Center survey 
(Pew Research website) examined the use of social media platforms in the United 
States. The survey found that there is a .68 probability that a randomly selected 
American will use Facebook and a .25 probability that a randomly selected American 
will use LinkedIn. In addition, there is a .22 probability that a randomly selected 
American will use both Facebook and LinkedIn. 

a. What is the probability that a randomly selected American will use Facebook or 
LinkedIn? 

b. What is the probability that a randomly selected American will not use either social 
media platform? 

26. Morningstar Mutual Fund Ratings. Information about mutual funds provided by 
Morningstar Investment Research includes the type of mutual fund (Domestic Equity, 
International Equity, or Fixed Income) and the Morningstar rating for the fund. The 
rating is expressed from 1-star (lowest rating) to 5-star (highest rating). Suppose a 
sample of 25 mutual funds provided the following counts: 

e Sixteen mutual funds were Domestic Equity funds. 

e Thirteen mutual funds were rated 3-star or less. 

e Seven of the Domestic Equity funds were rated 4-star. 
e Two of the Domestic Equity funds were rated 5-star. 


Assume that one of these 25 mutual funds will be randomly selected in order to learn 

more about the mutual fund and its investment strategy. 

a. What is the probability of selecting a Domestic Equity fund? 

b. What is the probability of selecting a fund with a 4-star or 5-star rating? 

c. What is the probability of selecting a fund that is both a Domestic Equity fund and 
a fund with a 4-star or 5-star rating? 

d. What is the probability of selecting a fund that is a Domestic Equity fund or a fund 
with a 4-star or 5-star rating? 

27. Social Media Use. A marketing firm would like to test-market the name of a new 
energy drink targeted at 18- to 29-year-olds via social media. A 2015 study by the Pew 
Research Center found that 35% of U.S. adults (18 and older) do not use social media 
(Pew Research Center website). The percentage of U.S. adults age 30 and older is 
78%. Suppose that the percentage of the U.S. adult population that is either age 18-29 
or uses social media is 67.2%. 

a. What is the probability that a randomly selected U.S. adult uses social media? 

b. What is the probability that a randomly selected U.S. adult is aged 18-29? 

c. What is the probability that a randomly selected U.S. adult is 18-29 and a user of 
social media? 

28. Survey on Car Rentals. A survey of magazine subscribers showed that 45.8% rented 
a car during the past 12 months for business reasons, 54% rented a car during the past 
12 months for personal reasons, and 30% rented a car during the past 12 months for 
both business and personal reasons. 

a. What is the probability that a subscriber rented a car during the past 12 months for 
business or personal reasons? 

b. What is the probability that a subscriber did not rent a car during the past 12 
months for either business or personal reasons? 

29. Ivy League Admissions. High school seniors with strong academic records apply 
to the nation’s most selective colleges in greater numbers each year. Because the 
number of slots remains relatively stable, some colleges reject more early appli- 
cants. Suppose that for a recent admissions class, an Ivy League college received 
2,851 applications for early admission. Of this group, it admitted 1,033 students 
early, rejected 854 outright, and deferred 964 to the regular admission pool for fur- 
ther consideration. In the past, this school has admitted 18% of the deferred early 
admission applicants during the regular admission process. Counting the students 
admitted early and the students admitted during the regular admission process, the 
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total class size was 2,375. Let E, R, and D represent the events that a student who 

applies for early admission is admitted early, rejected outright, or deferred to the 

regular admissions pool. 

a. Use the data to estimate P(E’), P(R), and P(D). 

b. Are events E and D mutually exclusive? Find P(E N D). 

c. For the 2,375 students who were admitted, what is the probability that a randomly 
selected student was accepted during early admission? 

d. Suppose a student applies for early admission. What is the probability that the 
student will be admitted for early admission or be deferred and later admitted dur- 
ing the regular admission process? 


4.4 Conditional Probability 


Often, the probability of an event is influenced by whether a related event already 
occurred. Suppose we have an event A with probability P(A). If we obtain new informa- 
tion and learn that a related event, denoted by B, already occurred, we will want to take 
advantage of this information by calculating a new probability for event A. This new 
probability of event A is called a conditional probability and is written P(A | B). We 
use the notation | to indicate that we are considering the probability of event A given the 
condition that event B has occurred. Hence, the notation P(A | B) reads “the probability 
of A given B.” 

As an illustration of the application of conditional probability, consider the situation 
of the promotion status of male and female officers of a major metropolitan police force 
in the eastern United States. The police force consists of 1,200 officers, 960 men and 
240 women. Over the past two years, 324 officers on the police force received promo- 
tions. The specific breakdown of promotions for male and female officers is shown in 
Table 4.4. 

After reviewing the promotion record, a committee of female officers raised a discrimi- 
nation case on the basis that 288 male officers had received promotions, but only 36 female 
officers had received promotions. The police administration argued that the relatively low 
number of promotions for female officers was due not to discrimination, but to the fact 
that relatively few females are members of the police force. Let us show how conditional 
probability could be used to analyze the discrimination charge. 

Let 


M = event an officer is a man 

W = event an officer is a woman 

A = event an officer is promoted 

A‘ = event an officer is not promoted 


Dividing the data values in Table 4.4 by the total of 1,200 officers enables us to summarize 
the available information with the following probability values. 


TABLE 4.4 Promotion Status of Police Officers Over the Past Two Years 


324 
876 


Promoted 
Not Promoted 
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TABLE 4.5 Joint Probability Table for Promotions 


Joint probabilities 

appear in the body 
of the table. 
Promoted (A) 
Not Promoted (A 


Total 


Men (M) Women (W) 


1.00 


Marginal probabilities 
appear in the margins 
of the table. 


P(MMA) = 288/1,200 = .24 probability that a randomly selected officer 
is a man and is promoted 


P(M MA‘) = 672/1,200 = .56 probability that a randomly selected officer 
is a man and is not promoted 


P(WOA) = 36/1,200 = .03 probability that a randomly selected officer 
is a woman and is promoted 


P(W N A‘) = 204/1,200 = .17 probability that a randomly selected officer 
is a woman and is not promoted 


Because each of these values gives the probability of the intersection of two events, the 
probabilities are called joint probabilities. Table 4.5, which provides a summary of the 
probability information for the police officer promotion situation, is referred to as a joint 
probability table. 

The values in the margins of the joint probability table provide the probabilities of each 
event separately. That is, PM) = .80, P(W) = .20, P(A) = .27, and P(A‘) = .73. These 
probabilities are referred to as marginal probabilities because of their location in the 
margins of the joint probability table. We note that the marginal probabilities are found by 
summing the joint probabilities in the corresponding row or column of the joint proba- 
bility table. For instance, the marginal probability of being promoted is P(A) = P(M N 
A) + P(W N A) = .24 + .03 = .27. From the marginal probabilities, we see that 80% of 
the force is male, 20% of the force is female, 27% of all officers received promotions, and 
73% were not promoted. 

Let us begin the conditional probability analysis by computing the probability 
that an officer is promoted given that the officer is a man. In conditional probability 
notation, we are attempting to determine P(A | M). To calculate P(A | M), we first 
realize that this notation simply means that we are considering the probability of the 
event A (promotion) given that the condition designated as event M (the officer is a 
man) is known to exist. Thus P(A | M) tells us that we are now concerned only with 
the promotion status of the 960 male officers. Because 288 of the 960 male officers 
received promotions, the probability of being promoted given that the officer is a man 
is 288/960 = .30. In other words, given that an officer is a man, that officer had a 30% 
chance of receiving a promotion over the past two years. 

This procedure was easy to apply because the values in Table 4.4 show the number of 
officers in each category. We now want to demonstrate how conditional probabilities such 
as P(A | M) can be computed directly from related event probabilities rather than the fre- 
quency data of Table 4.4. 

We have shown that P(A | M) = 288/960 = .30. Let us now divide both the numerator 
and denominator of this fraction by 1,200, the total number of officers in the study. 
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288  288/1,200 .24 
960 960/1,200 .80 


We now see that the conditional probability P(A | M) can be computed as .24/.80. Refer 
to the joint probability table (Table 4.5). Note in particular that .24 is the joint probability 
of A and M, that is, P(A N M) = .24. Also note that .80 is the marginal probability that a 
randomly selected officer is a man; that is, P(M) = .80. Thus, the conditional probability 
P(A | M) can be computed as the ratio of the joint probability P(A N M) to the marginal 
probability P(M). 


P(AIM) = = 30 


P(ANM) 24 
P(M) 80 


P(AIM) = = 30 


The fact that conditional probabilities can be computed as the ratio of a joint probability 
to a marginal probability provides the following general formula for conditional probability 
calculations for two events A and B. 


CONDITIONAL PROBABILITY 


P(A IB) = =— (4.7) 
or 
P(BIA) = — (4.8) 


The Venn diagram in Figure 4.8 is helpful in obtaining an intuitive understanding of 
conditional probability. The circle on the right shows that event B has occurred; the portion 
of the circle that overlaps with event A denotes the event (A M B). We know that once event 
B has occurred, the only way that we can also observe event A is for the event (A N B) 
to occur. Thus, the ratio P(A M B)/P(B) provides the conditional probability that we will 
observe event A given that event B has already occurred. 

Let us return to the issue of discrimination against the female officers. The marginal 
probability in row 1 of Table 4.5 shows that the probability of promotion of an officer is 
P(A) = .27 (regardless of whether that officer is male or female). However, the critical 
issue in the discrimination case involves the two conditional probabilities P(A | M ) and 
P(A | W). That is, what is the probability of a promotion given that the officer is a man, 
and what is the probability of a promotion given that the officer is a woman? If these two 


FIGURE 4.8 Conditional Probability P(A | B) = P(A N B)/P(B) 


Event A N B 


Event A Event B 
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probabilities are equal, a discrimination argument has no basis because the chances of a 
promotion are the same for male and female officers. However, a difference in the two 
conditional probabilities will support the position that male and female officers are treated 
differently in promotion decisions. 

We already determined that P(A | M ) = .30. Let us now use the probability values in 
Table 4.5 and the basic relationship of conditional probability in equation (4.7) to com- 
pute the probability that an officer is promoted given that the officer is a woman; that is, 
P(A | W). Using equation (4.7), with W replacing B, we obtain 


P(ANW) _ .03 


pany PW) 20 


= 15 


What conclusion do you draw? The probability of a promotion given that the officer is 
a man is .30, twice the .15 probability of a promotion given that the officer is a woman. 
Although the use of conditional probability does not in itself prove that discrimination 
exists in this case, the conditional probability values support the argument presented by the 
female officers. 


Independent Events 


In the preceding illustration, P(A) = .27, P(A | M) = .30, and P(A | W) = .15. We see that 
the probability of a promotion (event A) is affected or influenced by whether the officer is a 
man or a woman. Particularly, because P(A | M) # P(A), we would say that events A and M 
are dependent events. That is, the probability of event A (promotion) is altered or affected 
by knowing that event M (the officer is a man) exists. Similarly, with P(A | W) # P(A), we 
would say that events A and W are dependent events. However, if the probability of event A 
is not changed by the existence of event M—that is, P(A | M) = P(A)—we would say that 
events A and M are independent events. This situation leads to the following definition of 
the independence of two events. 


INDEPENDENT EVENTS 


Two events A and B are independent if 
P(A | B) = P(A) (4.9) 
or 
P(B | A) = P(B) (4.10) 


Otherwise, the events are dependent. 


Multiplication Law 


Whereas the addition law of probability is used to compute the probability of a union 
of two events, the multiplication law is used to compute the probability of the inter- 
section of two events. The multiplication law is based on the definition of conditional 
probability. Using equations (4.7) and (4.8) and solving for P(A N B), we obtain the 
multiplication law. 


MULTIPLICATION LAW 
P(A A B) = P(B)P(A | B) (4.11) 
or 


P(A N B) = P(A)P(B | A) (4.12) 
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To illustrate the use of the multiplication law, consider a telecommunications company 
that offers services such as high-speed Internet, cable television, and telephone services. For 
a particular city, it is known that 84% of the households subscribe to high-speed Internet 
service. If we let H denote the event that a household subscribes to high-speed Internet ser- 
vice, P(H) = .84. In addition, it is known that the probability that a household that already 
subscribes to high-speed Internet service also subscribes to cable television service (event 
C) is .75; that is, P(C | H) = .75. What is the probability that a household subscribes to both 
high-speed Internet and cable television services? Using the multiplication law, we compute 
the desired P(C N H) as 


P(CNH) = P(A)P(C | H) = .84(.75) = .63 


We now know that 63% of the households subscribe to both high-speed Internet and cable 
television services. 

Before concluding this section, let us consider the special case of the multiplication 
law when the events involved are independent. Recall that events A and B are independent 
whenever P(A | B) = P(A) or P(B | A) = P(B). Hence, using equations (4.11) and (4.12) 
for the special case of independent events, we obtain the following multiplication law. 


MULTIPLICATION LAW FOR INDEPENDENT EVENTS 


P(A A B) = P(A)P(B) (4.13) 


To compute the probability of the intersection of two independent events, we simply 
multiply the corresponding probabilities. Note that the multiplication law for independent 
events provides another way to determine whether A and B are independent. That is, if 
P(A N B) = P(A)P(B), then A and B are independent; if P(A N B) # P(A)P(B), then A 
and B are dependent. 

As an application of the multiplication law for independent events, consider the situ- 
ation of a service station manager who knows from past experience that 80% of the cus- 
tomers use a credit card when they purchase gasoline. What is the probability that the next 
two customers purchasing gasoline will each use a credit card? If we let 


A = the event that the first customer uses a credit card 
B = the event that the second customer uses a credit card 


then the event of interest is A N B. Given no other information, we can reasonably assume 
that A and B are independent events. Thus, 


P(A N B) = P(A)P(B) = (.80)(.80) = .64 


To summarize this section, we note that our interest in conditional probability is motiv- 
ated by the fact that events are often related. In such cases, we say the events are dependent 
and the conditional probability formulas in equations (4.7) and (4.8) must be used to com- 
pute the event probabilities. If two events are not related, they are independent; in this case 
neither event’s probability is affected by whether the other event occurred. 


NOTES + COMMENTS 


Do not confuse the notion of mutually exclusive events with If one mutually exclusive event is known to occur, the other 
that of independent events. Two events with nonzero prob- cannot occur; thus, the probability of the other event occur 
abilities cannot be both mutually exclusive and independent. ring is reduced to zero. They are therefore dependent. 
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EXERCISES 


Methods 
30. Suppose that we have two events, A and B, with P(A) = .50, P(B) = .60, and 
P(A MB) = .40. 


a. Find P(A | B). 

b. Find P(B | A). 

c. Are A and B independent? Why or why not? 

31. Assume that we have two events, A and B, that are mutually exclusive. Assume further 

that we know P(A) = .30 and P(B) = .40. 

a. What is P(A N B)? 

b. What is P(A | B)? 

c. A student in statistics argues that the concepts of mutually exclusive events and 
independent events are really the same, and that if events are mutually exclusive 
they must be independent. Do you agree with this statement? Use the probability 
information in this problem to justify your answer. 

d. What general conclusion would you make about mutually exclusive and independ- 
ent events given the results of this problem? 


Applications 

32. Living with Family. Consider the following example survey results of 18- to 34-year- 
olds in the United States, in response to the question “Are you currently living with your 
family?”: 


Yes No Totals 
Men 247 
Women 253 
Totals 198 302 500 


a. Develop the joint probability table for these data and use it to answer the follow- 
ing questions. 

b. What are the marginal probabilities? 

c. What is the probability of living with family given you are an 18- to 34-year-old 
man in the United States? 

d. What is the probability of living with family given you are an 18- to 34-year-old 
woman in the United States? 

e. What is the probability of an 18- to 34-year-old in the United States living with 
family? 

f. If, in the United States, 49.4% of 18- to 34-year-olds are male, do you consider this 
a good representative sample? Why? 

33. Intent to Pursue MBA. Students taking the Graduate Management Admissions Test 
(GMAT) were asked about their undergraduate major and intent to pursue their MBA 
as a full-time or part-time student. A summary of their responses follows. 


Undergraduate Major 


Business Engineering Other Totals 
Intended Full-Time 800 
Enrollment Part-Time 505 
Status Totals 502 358 445 1,305 
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a. Develop a joint probability table for these data. 

b. Use the marginal probabilities of undergraduate major (business, engineering, or 
other) to comment on which undergraduate major produces the most potential MBA 
students. 

c. If a student intends to attend classes full-time in pursuit of an MBA degree, what is 
the probability that the student was an undergraduate engineering major? 

d. If a student was an undergraduate business major, what is the probability that the 
student intends to attend classes full-time in pursuit of an MBA degree? 

e. Let F denote the event that the student intends to attend classes full-time in pursuit 
of an MBA degree, and let B denote the event that the student was an undergraduate 
business major. Are events F and B independent? Justify your answer. 

34. On-Time Performance of Airlines. The Bureau of Transportation Statistics reports on- 
time performance for airlines at major U.S. airports. JetBlue, United, and US Airways 
share Terminal C at Boston’s Logan Airport. The percentage of on-time flights reported 
for a sample month were 76.8% for JetBlue, 71.5% for United, and 82.2% for US Air- 
ways. Assume that 30% of the flights arriving at Terminal C are JetBlue flights, 32% are 
United flights, and 38% US Airways flights. 

a. Develop a joint probability table with three rows (the airlines) and two columns 
(on-time and late). 

b. An announcement is made that Flight 1382 will be arriving at gate 20 of Terminal 
C. What is the probability that Flight 1382 will arrive on time? 

c. What is the most likely airline for Flight 1382? What is the probability that Flight 
1382 is by this airline? 

d. Suppose that an announcement is made saying that Flight 1382 will now be arriving 
late. What is the most likely airline for this flight? What is the probability that 
Flight 1382 is by this airline? 

35. Better at Getting Deals. To better understand how husbands and wives feel about 
their finances, Money magazine conducted a national poll of 1,010 married adults age 
25 and older with household incomes of $50,000 or more (Money website). Consider 
the following example set of responses to the question, “Who is better at getting 
deals?” 


Who Is Better? 


Respondent | Am My Spouse We Are Equal 
Husband 278 127 102 
Wife 290 111 102 


a. Develop a joint probability table and use it to answer the following questions. 

b. Construct the marginal probabilities for Who Is Better (I Am, My Spouse, We Are 
Equal). Comment. 

c. Given that the respondent is a husband, what is the probability that he feels he is 
better at getting deals than his wife? 

d. Given that the respondent is a wife, what is the probability that she feels she is 
better at getting deals than her husband? 

e. Given a response “My spouse” is better at getting deals, what is the probability that 
the response came from a husband? 

f. Given a response “We are equal,” what is the probability that the response came 
from a husband? What is the probability that the response came from a wife? 

36. NBA Free Throws. Suppose that a particular NBA player makes 93% of his free 
throws. Assume that late in a basketball game, this player is fouled and is awarded two 
free throws. 
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What is the probability that he will make both free throws? 

What is the probability that he will make at least one free throw? 

What is the probability that he will miss both free throws? 

. Late in a basketball game, a team often intentionally fouls an opposing player in 
order to stop the game clock. The usual strategy is to intentionally foul the other 
team’s worst free-throw shooter. Assume that the team’s worst free throw shooter 
makes 58% of his free-throw shots. Calculate the probabilities for this player as 
shown in parts (a), (b), and (c), and show that intentionally fouling this player who 
makes 58% of his free throws is a better strategy than intentionally fouling the 
player who makes 93% of his free throws. Assume as in parts (a), (b), and (c) that 
two free throws will be awarded. 

37. Giving Up Electronics. A 2018 Pew Research Center survey found that more 
Americans believe they could give up their televisions than could give up their cell 
phones (Pew Research website). Assume that the following table represents the joint 
probabilities of Americans who could give up their television or cell phone. 


aogp 


Could Give Up Television 
Yes No 


Could Give Up Yes 
Cellphone No 


48 
152 


a. What is the probability that a person could give up her cell phone? 

b. What is the probability that a person who could give up her cell phone could also 
give up television? 

c. What is the probability that a person who could not give up her cell phone could 
give up television? 

d. Is the probability a person could give up television higher if the person could not 
give up a cell phone or if the person could give up a cell phone? 

38. Payback of Student Loans. The Institute for Higher Education Policy, a 
Washington, DC-based research firm, studied the payback of student loans 
for 1.8 million college students who had student loans that began to become 
due six years prior (The Wall Street Journal). The study found that 50% of the 
student loans were being paid back in a satisfactory fashion, and 50% of the 
student loans were delinquent. The following joint probability table shows 
the probabilities of a student’s loan status and whether or not the student had 
received a college degree. 


College Degree 
Yes No 


50 
-50 


Loan Satisfactory 
Status Delinquent 


a. What is the probability that a student with a student loan had received a college 
degree? 

b. What is the probability that a student with a student loan had not received a 
college degree? 
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c. Given the student had received a college degree, what is the probability that the 
student has a delinquent loan? 

d. Given the student had not received a college degree, what is the probability that the 
student has a delinquent loan? 

e. What is the impact of dropping out of college without a degree for students who 
have a student loan? 


4.5 Bayes’ Theorem 


In the discussion of conditional probability, we indicated that revising probabilities when 
new information is obtained is an important phase of probability analysis. Often, we begin 
the analysis with initial or prior probability estimates for specific events of interest. Then, 
from sources such as a sample, a special report, or a product test, we obtain additional 
information about the events. Given this new information, we update the prior probability 
values by calculating revised probabilities, referred to as posterior probabilities. Bayes’ 
theorem provides a means for making these probability calculations. The steps in this 
probability revision process are shown in Figure 4.9. 

As an application of Bayes’ theorem, consider a manufacturing firm that receives 
shipments of parts from two different suppliers. Let A, denote the event that a part is from 
supplier 1 and A, denote the event that a part is from supplier 2. Currently, 65% of the parts 
purchased by the company are from supplier | and the remaining 35% are from supplier 2. 
Hence, if a part is selected at random, we would assign the prior probabilities P(A,) = .65 
and P(A,) = .35. 

The quality of the purchased parts varies with the source of supply. Historical data 
suggest that the quality ratings of the two suppliers are as shown in Table 4.6. If we let G 
denote the event that a part is good and B denote the event that a part is bad, the informa- 
tion in Table 4.6 provides the following conditional probability values. 


P(GIA,) = .98 P(BIA,) = .02 
P(GIA,) = .95  P(BIA,) = .05 


The tree diagram in Figure 4.10 depicts the process of the firm receiving a part from 
one of the two suppliers and then discovering that the part is good or bad as a two-step 


FIGURE 4.9 Probability Revision Using Bayes’ Theorem 


Application 
of Bayes’ 
Theorem 


Prior New 
Probabilities Information 


Posterior 
Probabilities 


TABLE 4.6 Historical Quality Levels of Two Suppliers 


Percentage Percentage 

Good Parts Bad Parts 
Supplier 1 98 2 
Supplier 2 25 5 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


202 Chapter 4 Introduction to Probability 


FIGURE 4.10 Tree Diagram for Two-Supplier Example 


Step 1 Step 2 Experimental 
Supplier Condition Outcome 
I I 
G (4), G) 
(A, B) 
(A, G) 
(Ao, B) 


Note: Step 1 shows that the part comes from one of two suppliers, 
and step 2 shows whether the part is good or bad. 


experiment. We see that four experimental outcomes are possible; two correspond to the 
part being good and two correspond to the part being bad. 

Each of the experimental outcomes is the intersection of two events, so we can use the 
multiplication rule to compute the probabilities. For instance, 


P(A,, G) = P(A, NG) = P(A, P(GIA,) 


The process of computing these joint probabilities can be depicted in what is called a 
probability tree (see Figure 4.11). From left to right through the tree, the probabilities for 
each branch at step 1 are prior probabilities and the probabilities for each branch at step 2 are 
conditional probabilities. To find the probabilities of each experimental outcome, we simply 
multiply the probabilities on the branches leading to the outcome. Each of these joint proba- 
bilities is shown in Figure 4.11 along with the known probabilities for each branch. 


FIGURE 4.11 Probability Tree for Two-Supplier Example 


Step 1 Step 2 Probability of Outcome 
Supplier Condition 
| P(GIA,) P(A, N G) = P(A )P(G IA) = .6370 
| 
| 
I 


P(A, B) = P(A,)P(BIA)) = .0130 


.65 


a P(G 1A) P(A, N G) = P(A )P(G LA) = 3325 


P(A, N B) = P( A)P(B | A>) = .0175 
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The Reverend Thomas 
Bayes (1702-1761), a 
Presbyterian minister, is 
credited with the original 
work leading to the ver- 
sion of Bayes’ theorem in 
use today. 
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Suppose now that the parts from the two suppliers are used in the firm’s manufacturing 
process and that a machine breaks down because it attempts to process a bad part. Given 
the information that the part is bad, what is the probability that it came from supplier | and 
what is the probability that it came from supplier 2? With the information in the probability 
tree (Figure 4.11), Bayes’ theorem can be used to answer these questions. 

Letting B denote the event that the part is bad, we are looking for the posterior probabil- 
ities P(A, | B) and P(A, | B). From the law of conditional probability, we know that 


P(A, |B) = oe 4.14 
1 = P(B) ( ° ) 
Referring to the probability tree, we see that 

P(A, MB) = P(A,)P(B1A,) (4.15) 


To find P(B), we note that event B can occur in only two ways: (A, N B) and (A, N B). 
Therefore, we have 
P(B) = P(A, NB) + P(A, MB) 


= P(A,)P(B|A,) + P(A,)P(B1A,) (4.16) 


Substituting from equations (4.15) and (4.16) into equation (4.14) and writing a similar 
result for P(A, | B), we obtain Bayes’ theorem for the case of two events. 


BAYES’ THEOREM (TWO-EVENT CASE) 


P(A,)P(B1A,) 
(A,)P(BIA,) + P(A,)P(B I A,) 


P(A, |B) = > (4.17) 


P(A,)P(B | A,) 
P(A,)P(BIA,) + P(A,)P(B1A,) 


P(A, |B) = (4.18) 


Using equation (4.17) and the probability values provided in the example, we have 


P(A,)P(B1A,) 


Pa a= P(A,)P(BIA,) + P(A;)P(BIA3) 
(.65)(.02) .0130 
~ (65)(.02) + (.35)(.05)  .0130 + .0175 
= pip = 4262 
0305 


In addition, using equation (4.18), we find P(A, | B). 


(.35)(.05) 
Pa lB) = (.65)(.02) + (.35)(.05) 
0175 0175 


= = = .5738 

0130 + .0175 .0305 
Note that in this application we started with a probability of .65 that a part selected at ran- 
dom was from supplier 1. However, given information that the part is bad, the probability 
that the part is from supplier | drops to .4262. In fact, if the part is bad, there is more than a 
50-50 chance that it came from supplier 2; that is, P(A, | B) = .5738. 
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Bayes’ theorem is applicable when the events for which we want to compute posterior 
probabilities are mutually exclusive and their union is the entire sample space.' For the 


case of n mutually exclusive events A,, A,,...,A,, Whose union is the entire sample 
space, Bayes’ theorem can be used to compute any posterior probability P(A; | B), as 
shown here. 
BAYES’ THEOREM 
P(A,)P(BLA,) 
P(A,|B) = (4.19) 


P(A,)P(BIA,) + P(A,)P(BIA,) + --- + P(A,)P(BIA,) 


With prior probabilities P(A,), P(A), . . . , P(A,,) and the appropriate conditional probabili- 
ties P(B | A,), P(B | A>), . . . , P(B | A„), equation (4.19) can be used to compute the pos- 
terior probability of the events A,,A,,...,A 


n° 


Tabular Approach 


A tabular approach is helpful in conducting the Bayes’ theorem calculations. Such an 
approach is shown in Table 4.7 for the parts supplier problem. The computations shown 
there are done in the following steps. 


Step 1. Prepare the following three columns: 

Column 1—The mutually exclusive events A, for which posterior probabilities 
are desired 

Column 2—The prior probabilities P(A;) for the events 

Column 3—The conditional probabilities P(B | A;) of the new information B 
given each event 

Step 2. In column 4, compute the joint probabilities P(A;M B) for each event and the new 
information B by using the multiplication law. These joint probabilities are found 
by multiplying the prior probabilities in column 2 by the corresponding condi- 
tional probabilities in column 3; that is, P(A; Ñ B) = P(A,)P(B | A). 

Step 3. Sum the joint probabilities in column 4. The sum is the probability of the new 
information, P(B). Thus we see in Table 4.7 that there is a .0130 probability 
that the part came from supplier 1 and is bad and a .0175 probability that the 
part came from supplier 2 and is bad. Because these are the only two ways in 
which a bad part can be obtained, the sum .0130 + .0175 shows an overall 
probability of .0305 of finding a bad part from the combined shipments of the 
two suppliers. 


TABLE 4.7 Tabular Approach to Bayes’ Theorem Calculations for the 


Two-Supplier Problem 


(1) (2) (3) (4) (5) 
Prior Conditional Joint Posterior 
Events Probabilities Probabilities Probabilities Probabilities 
A; P(A) P(B | A) P(A; N B) P(A; | B) 
A, .65 102 .0130 .0130/.0305 = .4262 
A 35 105 10175 0175/0305 =" 25738 
1.00 P(B) = 0305 1.0000 


‘If the union of events is the entire sample space, the events are said to be collectively exhaustive. 
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Step 4. In column 5, compute the posterior probabilities using the basic relationship of 
conditional probability. 


P(A;NB) 


P(a,|B) = 


Note that the joint probabilities P(A; NÑ B) are in column 4 and the probability 
P(B) is the sum of column 4. 


NOTES + COMMENTS 


1. Bayes’ theorem is used extensively in decision analysis. 2. An event and its complement are mutually exclusive, and 


The prior probabilities are often subjective estimates pro- their union is the entire sample space. Thus, Bayes’ the- 
vided by a decision maker. Sample information is obtained orem is always applicable for computing posterior prob- 
and posterior probabilities are computed for use in choos- abilities of an event and its complement. 


ing the best decision. 


EXERCISES 


Methods 
39. The prior probabilities for events A, and A, are P(A,) = .40 and P(A,) = .60. It is also 
known that P(A, N A,) = 0. Suppose P(B | A,) = .20 and P(B | A,) = .05. 
a. Are A, and A, mutually exclusive? Explain. 
b. Compute P(A, N B) and P(A, N B). 
c. Compute P(B). 
d. Apply Bayes’ theorem to compute P(A, | B) and P(A, | B). 
40. The prior probabilities for events A,, A,, and A, are P(A,) = .20, P(A,) = .50, and 
P(A,) = .30, respectively. The conditional probabilities of event B given A,, A,, and 
A, are P(B | A,) = .50, P(B | A,) = .40, and P(B | A;) = .30. 
a. Compute P(B N A,), P(B N A,), and P(B N A;). 
b. Apply Bayes’ theorem, equation (4.19), to compute the posterior probability 
P(A, | B). 
c. Use the tabular approach to applying Bayes’ theorem to compute P(A, | B), 
P(A, | B), and P(A, | B). 


Applications 

41. Consulting Firm Bids. A consulting firm submitted a bid for a large research project. 
The firm’s management initially felt they had a 50-50 chance of getting the project. 
However, the agency to which the bid was submitted subsequently requested addi- 
tional information on the bid. Past experience indicates that for 75% of the successful 
bids and 40% of the unsuccessful bids the agency requested additional information. 

a. What is the prior probability of the bid being successful (that is, prior to the request 
for additional information)? 

b. What is the conditional probability of a request for additional information given 
that the bid will ultimately be successful? 

c. Compute the posterior probability that the bid will be successful given a request for 
additional information. 

42. Credit Card Defaults. A local bank reviewed its credit card policy with the intention of 
recalling some of its credit cards. In the past approximately 5% of cardholders defaul- 
ted, leaving the bank unable to collect the outstanding balance. Hence, management 
established a prior probability of .05 that any particular cardholder will default. The 
bank also found that the probability of missing a monthly payment is .20 for customers 
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who do not default. Of course, the probability of missing a monthly payment for those 

who default is 1. 

a. Given that a customer missed a monthly payment, compute the posterior probabil- 
ity that the customer will default. 

b. The bank would like to recall its credit card if the probability that a customer will 
default is greater than .20. Should the bank recall its credit card if the customer 
misses a monthly payment? Why or why not? 

43. Prostate Cancer Screening. According to a 2018 article in Esquire magazine, ap- 
proximately 70% of males over age 70 will develop cancerous cells in their prostate. 
Prostate cancer is second only to skin cancer as the most common form of cancer for 
males in the United States. One of the most common tests for the detection of prostate 
cancer is the prostate-specific antigen (PSA) test. However, this test is known to have 
a high false-positive rate (tests that come back positive for cancer when no cancer 
is present). Suppose there is a .02 probability that a male patient has prostate cancer 
before testing. The probability of a false-positive test is .75, and the probability of a 
false-negative (no indication of cancer when cancer is actually present) is .20. 

a. What is the probability that the male patient has prostate cancer if the PSA test 
comes back positive? 

b. What is the probability that the male patient has prostate cancer if the PSA test 
comes back negative? 

c. For older men, the prior probability of having cancer increases. Suppose that the prior 
probability of the male patient is .3 rather than .02. What is the probability that the male 
patient has prostate cancer if the PSA test comes back positive? What is the probability 
that the male patient has prostate cancer if the PSA test comes back negative? 

d. What can you infer about the PSA test from the results of parts (a), (b), and (c)? 

44. Golf Equipment Website Visitors. ParFore created a website to market golf equip- 
ment and golf apparel. Management would like a special pop-up offer to appear for 
female website visitors and a different special pop-up offer to appear for male website 
visitors. From a sample of past website visitors, ParFore’s management learned that 
60% of the visitors are male and 40% are female. 

a. What is the probability that a current visitor to the website is female? 

b. Suppose 30% of ParFore’s female visitors previously visited the Dillard’s De- 
partment Store website and 10% of ParFore’s male visitors previously visited the 
Dillard’s Department Store website. If the current visitor to ParFore’s website pre- 
viously visited the Dillard’s website, what is the revised probability that the current 
visitor is female? Should the ParFore’s website display the special offer that appeals 
to female visitors or the special offer that appeals to male visitors? 

45. Americans Without Health Insurance. The National Center for Health Statistics, 
housed within the U.S. Centers for Disease Control and Prevention (CDC), tracks the 
number of adults in the United States who have health insurance. According to this 
agency, the uninsured rates for Americans in 2018 are as follows: 5.1% of those under 
the age of 18, 12.4% of those ages 18—64, and 1.1% of those 65 and older do not have 
health insurance (CDC website). Approximately 22.8% of Americans are under age 
18, and 61.4% of Americans are ages 18—64. 

a. What is the probability that a randomly selected person in the United States is 
65 or older? 

b. Given that the person is an uninsured American, what is the probability that the 
person is 65 or older? 


nM a 


TEET. eececeeeeeeeeeeeeeeoeeoeeeeeeeeeeeeeeeeeens 


In this chapter we introduced basic probability concepts and illustrated how probability 
analysis can be used to provide helpful information for decision making. We described 
how probability can be interpreted as a numerical measure of the likelihood that an event 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


Glossary 207 


will occur. In addition, we saw that the probability of an event can be computed either by 
summing the probabilities of the experimental outcomes (sample points) comprising the 
event or by using the relationships established by the addition, conditional probability, and 
multiplication laws of probability. For cases in which additional information is available, 
we showed how Bayes’ theorem can be used to obtain revised or posterior probabilities. 


Addition law A probability law used to compute the probability of the union of two 
events. Itis P(A M B) = P(A) + P(B) — P(A U B). For mutually exclusive events, 

P(A N B) = 0; in this case the addition law reduces to P(A U B) = P(A) + P(B). 

Basic requirements for assigning probabilities Two requirements that restrict the 
manner in which probability assignments can be made: (1) For each experimental outcome 
E; we must have 0 = P(E) = 1; (2) considering all experimental outcomes, we must have 
P(E) + P(E) + +++ + PCE,) = 1.0. 

Bayes’ theorem A method used to compute posterior probabilities. 

Classical method A method of assigning probabilities that is appropriate when all the 
experimental outcomes are equally likely. 

Combination In an experiment we may be interested in determining the number of ways 
n objects may be selected from among N objects without regard to the order in which the 
n objects are selected. Each selection of n objects is called a combination and the total 

N N! 


n) n\(N—n)! 


number of combinations of N objects taken n at a time is C = for 
n=0,1,2,...,N. 
Complement of A The event consisting of all sample points that are not in A. 
Conditional probability The probability of an event given that another event already 
occurred. The conditional probability of A given B is P(A | B) = > 
Event A collection of sample points. 

Experiment A process that generates well-defined outcomes. 

Independent events Two events A and B where P(A | B) = P(A) or P(B | A) = P(B); that 
is, the events have no influence on each other. 

Intersection of A and B The event containing the sample points belonging to both A and 
B. The intersection is denoted A M B. 

Joint probability The probability of two events both occurring; that is, the probability of 
the intersection of two events. 

Marginal probability The values in the margins of a joint probability table that provide 
the probabilities of each event separately. 

Multiple-step experiment An experiment that can be described as a sequence of steps. 

If a multiple-step experiment has k steps with n, possible outcomes on the first step, n, pos- 
sible outcomes on the second step, and so on, the total number of experimental outcomes is 
given by (1,)(n5) . . . (n). 

Multiplication law A probability law used to compute the probability of the intersection 
of two events. It is P(A N B) = P(B)P(A | B) or P(A AN B) = P(A)P(B | A). For independent 
events it reduces to P(A M B) = P(A)P(B). 

Mutually exclusive events Events that have no sample points in common; that is, A N B 
is empty and P(A M B) = 0. 

Permutation In an experiment we may be interested in determining the number of ways 

n objects may be selected from among N objects when the order in which the n objects 

are selected is important. Each ordering of n objects is called a permutation and the total 


N N! 
number of permutations of N objects taken n at a time is P = n( = for 
n=0,1,2,...,N. ay N= 
Posterior probabilities Revised probabilities of events based on additional information. 
Prior probabilities Initial estimates of the probabilities of events. 
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Probability A numerical measure of the likelihood that an event will occur. 

Relative frequency method A method of assigning probabilities that is appropriate when 
data are available to estimate the proportion of the time the experimental outcome will 
occur if the experiment is repeated a large number of times. 

Sample point An element of the sample space. A sample point represents an experimental 
outcome. 

Sample space The set of all experimental outcomes. 

Subjective method A method of assigning probabilities on the basis of judgment. 

Tree diagram A graphical representation that helps in visualizing a multiple-step 
experiment. 

Union of A and B The event containing all sample points belonging to A or B or both. The 
union is denoted A U B. 

Venn diagram A graphical representation for showing symbolically the sample space and 
operations involving events in which the sample space is represented by a rectangle and 
events are represented as circles within the sample space. 


Bee a es 


Counting Rule for Combinations 


! 
as (") 7 Te n)! 4D 
Counting Rule for Permutations 
! 
Př = m(*) = ray (4.2) 
Computing Probability Using the Complement 
P(A) = 1 — P(A°) (4.5) 
Addition Law 
P(A UB) = P(A) + P(B) — P(ANB) (4.6) 
Conditional Probability 
P(A |B) = a (4.7) 
P(BIA) = a (4.8) 
Multiplication Law 
P(A N B) = P(B)P(A | B) (4.11) 
P(A N B) = P(A)P(B | A) (4.12) 
Multiplication Law for Independent Events 
P(A N B) = P(A)P(B) (4.13) 
Bayes’ Theorem 
P(A;)P(B1A;) 


P(A;1B) = 5 (4.19) 


(A,)P(BIA,) + P(A,)P(BIA,) +--+ P(A,)P(BIA,) 


conducted by Princess Cruises asked how many days into a vacation it takes until 
respondents feel truly relaxed. The responses were as follows. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


Supplementary Exercises 209 


Days Until Relaxed Frequency 
= | 422 

2 181 

3 80 
=4,<o nla 
Never (°°) 201 


a. How many adults participated in the Princess Cruises survey? 
b. What response has the highest probability? What is the probability of this response? 
c. What is the probability a respondent never feels truly relaxed on a vacation? 
d. What is the probability it takes a respondent 2 or more days to feel truly relaxed? 
47. Financial Manager Investments. A financial manager made two new investments— 
one in the oil industry and one in municipal bonds. After a one-year period, each of 
the investments will be classified as either successful or unsuccessful. Consider the 
making of the two investments as an experiment. 
a. How many sample points exist for this experiment? 
b. Show a tree diagram and list the sample points. 
c. Let O = the event that the oil industry investment is successful and M = the event that 
the municipal bond investment is successful. List the sample points in O and in M. 
d. List the sample points in the union of the events (O U M). 
e. List the sample points in the intersection of the events (O N M). 
f. Are events O and M mutually exclusive? Explain. 
48. Opinions About Television Programs. Below are the results of a survey of 1364 
individuals who were asked if they use social media and other websites to voice their 
opinions about television programs. 


Uses Social Media and Other Doesn't Use Social Media and 
Websites to Voice Opinions Other Websites to Voice Opinions 


About Television Programs About Television Programs 
Female 395 291 
Male B28 355 


a. Show a joint probability table. 

b. What is the probability a respondent is female? 

c. What is the conditional probability a respondent uses social media and other web- 
sites to voice opinions about television programs given the respondent is female? 

d. Let F denote the event that the respondent is female and A denote the event that the 
respondent uses social media and other websites to voice opinions about television 
programs. Are events F and A independent? 

49. Treatment-Caused Injuries. A study of 31,000 hospital admissions in New York 
State found that 4% of the admissions led to treatment-caused injuries. One-seventh 
of these treatment-caused injuries resulted in death, and one-fourth were caused by 
negligence. Malpractice claims were filed in one out of 7.5 cases involving negligence, 
and payments were made in one out of every two claims. 

a. What is the probability a person admitted to the hospital will suffer a treatment- 
caused injury due to negligence? 

b. What is the probability a person admitted to the hospital will die from a treatment- 
caused injury? 

c. What is the probability a person admitted to the hospital will result in a malpractice 
claim that must be paid due to a negligent treatment caused injury? 
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50. Viewer Response to New Television Show. A survey to determine viewer response to 
a new television show obtained the following data. 


Rating Frequency 
Poor 4 
Below average 8 
Average 1 
Above average 14 
Excellent 13 


a. What is the probability that a randomly selected viewer will rate the new show as 
average or better? 

b. What is the probability that a randomly selected viewer will rate the new show 
below average or worse? 

51. Highest Level of Education and Household Income. The U.S. Census Bureau serves 

as the leading source of quantitative data about the nation’s people and economy. The fol- 

lowing crosstabulation shows the number of households 1,000s and the household income 

by the highest level of education for the head of household (U.S. Census Bureau website). 

Only households in which the head has a high school diploma or more are included. 


Household Income 


Highest Level Under $25,000 to $50,000to $100,000 

of Education $25,000 $49,999 $99,999 and Over | Total 

High school graduate 32,773 

Bachelor's degree 221321 

Master's degree 9,003 

Doctoral degree IRET 
Total 13,128 15,499 20,548 16,469 65,644 


a. Develop a joint probability table. 

b. What is the probability of the head of one of these households having a master’s 
degree or more education? 

c. What is the probability of a household headed by someone with a high school dip- 
loma earning $100,000 or more? 

d. What is the probability of one of these households having an income below $25,000? 

e. What is the probability of a household headed by someone with a bachelor’s degree 
earning less than $25,000? 

f. Is household income independent of educational level? 

52. MBA New-Matriculants Survey. An MBA new-matriculants survey provided the 
following data for 2018 students. 


Applied to More 
Than One School 


Yes No 
23 and under 
24-26 
Age 27-30 
Group 31-35 
36 and over 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


Supplementary Exercises 211 


a. For a randomly selected MBA student, prepare a joint probability table for the 
experiment consisting of observing the student’s age and whether the student ap- 
plied to one or more schools. 

b. What is the probability that a randomly selected applicant is 23 or under? 

c. What is the probability that a randomly selected applicant is older than 26? 

d. What is the probability that a randomly selected applicant applied to more than one 
school? 

53. MBA New-Matriculants Survey. Refer again to the data from the MBA new matricu- 
lants survey in exercise 52. 

a. Given that a person applied to more than one school, what is the probability that the 
person is 24—26 years old? 

b. Given that a person is in the 36-and-over age group, what is the probability that the 
person applied to more than one school? 

c. What is the probability that a person is 24—26 years old or applied to more than one 
school? 

d. Suppose a person is known to have applied to only one school. What is the proba- 
bility that the person is 31 or more years old? 

e. Is the number of schools applied to independent of age? Explain. 

54. Internet Sites Collecting User Information. The Pew Internet & American Life 
project conducted a survey that included several questions about how Internet users 
feel about search engines and other websites collecting information about them and 
using this information either to shape search results or target advertising to them. 

In one question, participants were asked, “If a search engine kept track of what you 

search for, and then used that information to personalize your future search results, 

how would you feel about that?” Respondents could indicate either “Would not be 
okay with it because you feel it is an invasion of your privacy” or “Would be okay with 
it, even if it means they are gathering information about you.” Joint probabilities of 
responses and age groups are summarized in the following table. 


Age Not Okay Okay 


18-29 1485 .0604 
30-49 223 .0907 
50+ .4008 10723 


a. What is the probability a respondent will say she or he is not okay with this practice? 

b. Given a respondent is 30—49 years old, what is the probability the respondent will 
say she or he is okay with this practice? 

c. Given a respondent says she or he is not okay with this practice, what is the probab- 
ility the respondent is 50+ years old? 

d. Is the attitude about this practice independent of the age of the respondent? Why or 
why not? 

e. Do attitudes toward this practice differ for respondents who are 18-29 years old 
and respondents who are 50+ years old? 

55. Advertisements and Product Purchases. A large consumer goods company ran a 
television advertisement for one of its soap products. On the basis of a survey that was 
conducted, probabilities were assigned to the following events. 


B = individual purchased the product 
S = individual recalls seeing the advertisement 
BS = individual purchased the product and recalls seeing the advertisement 
The probabilities assigned were P(B) = .20, P(S) = .40, and P(B N S) = .12. 


a. What is the probability of an individual’s purchasing the product given that the indi- 
vidual recalls seeing the advertisement? Does seeing the advertisement increase the 
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probability that the individual will purchase the product? As a decision maker, would 
you recommend continuing the advertisement (assuming that the cost is reasonable)? 
b. Assume that individuals who do not purchase the company’s soap product buy 
from its competitors. What would be your estimate of the company’s market share? 
Would you expect that continuing the advertisement will increase the company’s 
market share? Why or why not? 
c. The company also tested another advertisement and assigned it values of 
P(S) = .30 and P(B N S) = .10. What is P(B | S) for this other advertisement? 
Which advertisement seems to have had the bigger effect on customer purchases? 
56. Days Listed Until Sold. Cooper Realty is a small real estate company located in 
Albany, New York, specializing primarily in residential listings. They recently 
became interested in determining the likelihood of one of their listings being sold 
within a certain number of days. An analysis of company sales of 800 homes in 
previous years produced the following data. 


Days Listed Until Sold 
Under 30 31-90 Over 90 | Total 


Under $150,000 100 

Initial Asking $150,000-$199,999 250 
Price $200,000-$250,000 400 
Over $250,000 50 

Total 100 500 200 800 


a. If A is defined as the event that a home is listed for more than 90 days before being 
sold, estimate the probability of A. 

b. If B is defined as the event that the initial asking price is under $150,000, estimate 
the probability of B. 

c. What is the probability of A N B? 

d. Assuming that a contract was just signed to list a home with an initial asking price 
of less than $150,000, what is the probability that the home will take Cooper Realty 
more than 90 days to sell? 

e. Are events A and B independent? 

57. Lost-Time Accidents. A company studied the number of lost-time accidents 
occurring at its Brownsville, Texas, plant. Historical records show that 6% of 
the employees suffered lost-time accidents last year. Management believes that a 
special safety program will reduce such accidents to 5% during the current year. In 
addition, it estimates that 15% of employees who had lost-time accidents last year 
will experience a lost-time accident during the current year. 

a. What percentage of the employees will experience lost-time accidents in both years? 

b. What percentage of the employees will suffer at least one lost-time accident over 
the two-year period? 

58. Students Studying Abroad. Many undergraduate students in the U.S. study 
abroad as part of their education. Assume that 60% of the undergraduate students 
who study abroad are female and that 49% of the undergraduate students who do 
not study abroad are female. 

a. Given a female undergraduate student, what is the probability that she studies abroad? 

b. Given a male undergraduate student, what is the probability that he studies abroad? 

c. What is the overall percentage of full-time female undergraduate students? What is 
the overall percentage of full-time male undergraduate students? 

59. Finding Oil in Alaska. An oil company purchased an option on land in Alaska. Pre- 
liminary geologic studies assigned the following prior probabilities. 
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P(high-quality oil) = .50 
P(medium-quality oil) = .20 
P(no oil) = .30 


a. What is the probability of finding oil? 
b. After 200 feet of drilling on the first well, a soil test is taken. The probabilities of 
finding the particular type of soil identified by the test follow. 


P(soil | high-quality oil) = .20 
P(soil | medium-quality oil) = .80 
P(soil | no oil) = .20 


How should the firm interpret the soil test? What are the revised probabilities, and 
what is the new probability of finding oil? 

60. Spam E-mail Filters. A study by Forbes indicated that the five most common words 
appearing in spam e-mails are shipping!, today!, here!, available, and fingertips! Many 
spam filters separate spam from ham (e-mail not considered to be spam) through ap- 
plication of Bayes’ theorem. Suppose that for one e-mail account, 1 in every 10 mes- 
sages is spam and the proportions of spam messages that have the five most common 
words in spam e-mail are given below. 


shipping! 051 
today! 045 
here! 034 
available 014 
fingertips! 014 
Also suppose that the proportions of ham messages that have these words are 
shipping! 0015 
today! .0022 
here! .0022 
available 0041 
fingertips! 0011 


a. If a message includes the word shipping!, what is the probability the message is 
spam? If a message includes the word shipping!, what is the probability the mes- 
sage is ham? Should messages that include the word shipping! be flagged as spam? 

b. If a message includes the word today!, what is the probability the message is spam? 
If a message includes the word here!, what is the probability the message is spam? 
Which of these two words is a stronger indicator that a message is spam? Why? 

c. Ifa message includes the word available, what is the probability the message is spam? 
If a message includes the word fingertips!, what is the probability the message is 
spam? Which of these two words is a stronger indicator that a message is spam? Why? 

d. What insights do the results of parts (b) and (c) yield about what enables a spam 
filter that uses Bayes’ theorem to work effectively? 


CASE PROBLEM 1: 


oO ee te ses 


Hamilton County judges try thousands of cases per year. In an overwhelming majority of 
the cases disposed, the verdict stands as rendered. However, some cases are appealed, and 
of those appealed, some of the cases are reversed. Kristen DelGuzzi of The Cincinnati 
Enquirer conducted a study of cases handled by Hamilton County judges over a three-year 
period. Shown in Table 4.8 are the results for 182,908 cases handled (disposed) by 38 judges 
in Common Pleas Court, Domestic Relations Court, and Municipal Court. Two of the judges 
(Dinkelacker and Hogan) did not serve in the same court for the entire three-year period. 
The purpose of the newspaper’s study was to evaluate the performance of the 
judges. Appeals are often the result of mistakes made by judges, and the newspaper 
wanted to know which judges were doing a good job and which were making too many 
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TABLE 4.8 Total Cases Disposed, Appealed, and Reversed in Hamilton 


County Courts 


Common Pleas Court 


Total Cases Appealed Reversed 
Judge Disposed Cases Cases 
Fred Cartolano 3,037 137 12 
> é Thomas Crush 3372 119 10 
= DATA file Patrick Dinkelacker 1,258 44 8 
Judge Timothy Hogan 1,954 60 7 
Robert Kraft 3,138 127 7 
William Mathews 2,264 91 18 
William Morrissey 3,032 121 22 
Norbert Nadel 2,959 131 20 
Arthur Ney, Jr. SANS) 125 14 
Richard Niehaus 373593 137 16 
Thomas Nurre 3,000 121 6 
John O'Connor 2,969 129 12 
Robert Ruehlman 3,205 145 18 
J. Howard Sundermann O55 60 10 
Ann Marie Tracey 3,141 127 13 
Ralph Winkler 3,089 88 _ © 
Total 43,945 1,762 199 


Domestic Relations Court 


Total Cases Appealed Reversed 
Judge Disposed Cases Cases 
Penelope Cunningham 229 7 1 
Patrick Dinkelacker 6,001 19 4 
Deborah Gaines 8,799 48 9 
Ronald Panioto 12,970 32 3 
Total 30,499 106 17 


Municipal Court 


Total Cases Appealed Reversed 

Judge Disposed Cases Cases 
Mike Allen 6,149 43 4 
Nadine Allen 7O22 34 6 
Timothy Black 7,954 41 6 
David Davis ZG 43 5 
Leslie Isaiah Gaines 5,282 35 13 
Karla Grady 5,253 6 0 
Deidra Hair 2532 5 0 
Dennis Helmick 7,900 29 5 
Timothy Hogan 2,308 13 2 
James Patrick Kenney 2,798 6 1 
Joseph Luebbers 4,698 25 8 
William Mallory S21 38 9 
Melba Marsh 8,219 34 7 
Beth Mattingly 227 13 1 
Albert Mestemaker 4,975 28 9 
Mark Painter 27239) 7 3 
Jack Rosen 7,790 41 13 
Mark Schweikert 5,403 33 6 
David Stockdale 5 oval 22 4 
John A. West 2,797 _4 _2 

Total 108,464 500 104 
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mistakes. You are called in to assist in the data analysis. Use your knowledge of proba- 
bility and conditional probability to help with the ranking of the judges. You also may 
be able to analyze the likelihood of appeal and reversal for cases handled by different 
courts. 


Managerial Report 

Prepare a report with your rankings of the judges. Also, include an analysis of the like- 
lihood of appeal and case reversal in the three courts. At a minimum, your report should 
include the following: 


1. The probability of cases being appealed and reversed in the three different courts. 
The probability of a case being appealed for each judge. 

The probability of a case being reversed for each judge. 

The probability of reversal given an appeal for each judge. 

Rank the judges within each court. State the criteria you used and provide a rationale 
for your choice. 


dr ae 


CASE PROBLEM 2: ROB'S MARKET 


Rob’s Market (RM) is a regional food store chain in the southwest United States. David 
Crawford, director of Business Intelligence for RM, would like to initiate a study of the 
purchase behavior of customers who use the RM loyalty card (a card that customers scan 
at checkout to qualify for discounted prices). The use of the loyalty card allows RM to cap- 
ture what is known as “point-of-sale” data, that is, a list of products purchased by a given 
customer as he/she checks out of the market. David feels that better understanding of which 
products tend to be purchased together could lead to insights for better pricing and display 
strategies as well as a better understanding of sales and the potential impact of different 
levels of coupon discounts. This type of analysis is known as market basket analysis, as it 
is a study of what different customers have in their shopping baskets as they check out of 
the store. 

As a prototype study, David wants to investigate customer buying behavior with regard 
to bread, jelly, and peanut butter. RM’s Information Technology (IT) group, at David’s 
request, has provided a data set of purchases by 1,000 customers over a one-week period. 
The data set contains the following variables for each customer: 


: e Bread — wheat, white, or none 
DATA f ile e Jelly — grape, strawberry, or none 
e Peanut butter — creamy, natural, or none 


(@ 


MarketBasket 


The variables appear in the above order from left to right in the data set, where each row is 
a customer. For example, the first record of the data set is 


white grape none 


which means that customer | purchased white bread, grape jelly, and no peanut butter. The 
second record is 


white strawberry none 


which means that customer 2 purchased white bread, strawberry jelly, and no peanut 
butter. The sixth record in the data set is 


none none none 


which means that the sixth customer did not purchase bread, jelly, or peanut butter. 

Other records are interpreted in a similar fashion. 

David would like you to do an initial study of the data to get a better understanding of 
RM customer behavior with regard to these three products. 
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Managerial Report 
Prepare a report that gives insight into the purchase behavior of customers who use the RM 
loyalty card. At a minimum your report should include estimates of the following: 


1. The probability that a random customer does not purchase any of the three products 
(bread, jelly, or peanut butter). 

2. The probability that a random customer purchases white bread. 

3. The probability that a random customer purchases wheat bread. 

4. The probability that a random customer purchases grape jelly given that he/she 
purchases white bread. 

5. The probability that a random customer purchases strawberry jelly given that he/she 
purchases white bread. 

6. The probability that a random customer purchases creamy peanut butter given that he/ 
she purchases white bread. 

7. The probability that a random customer purchases natural peanut butter given that he/ 
she purchases white bread. 

8. The probability that a random customer purchases creamy peanut butter given that he/ 
she purchases wheat bread. 

9. The probability that a random customer purchases natural peanut butter given that he/ 
she purchases wheat bread. 

10. The probability that a random customer purchases white bread, grape jelly, and creamy 

peanut butter. 


One way to answer 

these questions is to use 
pivot tables (discussed 

in Chapter 2) to obtain 
absolute frequencies and 
use the pivot table results 
to calculate the relevant 
probabilities. 
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APPENDIX 


Chapter 5 Discrete Probability Distributions 


APPENDIX 5.1: DISCRETE PROBABILITY DISTRIBUTIONS WITH R 
(MINDTAP READER) 


STATISTICS IN PRACTICE 


Voter Waiting Times in Elections* 


Most people in the United States who vote during an 
election do so by arriving to a specific location known 
as a precinct polling location and casting a ballot in per- 
son. In recent elections, some voters have experienced 
extremely long waiting times to cast their ballots. This 
has been a cause of some concern because it could 
potentially disenfranchise voters who cannot wait in line 
to cast their ballots. 

Statisticians have developed models for elections that 
estimate the arrivals to precinct polling locations and wait 
times for voters. These models use mathematical equa- 
tions from the field of queueing theory to estimate wait 
times for voters. The wait time depends on many factors, 
including how many voting machines or voting booths are 
available at the polling precinct location, the length of the 
election ballot, and the arrival rate of voters. 

Data collected on voter arrivals show that voter 
arrivals follow a probability distribution known as 
the Poisson distribution. Using the properties of the 
Poisson distribution, statisticians can compute the 
probabilities for the number of voters arriving during 
any time period. For example, let x = the number of 
voters arriving to a particular precinct polling loca- 
tion during a one-minute period. Assuming that this 


*This Statistics in Practice is based on research done by Muer Yang, 
Michael J. Fry, Ted Allen, and W. David Kelton. 


location has a mean arrival rate of two voters per 
minute, the following table shows the probabilities 
for the number of voters arriving during a one-minute 
period. 


Probability 


1258 
.2707 
.2707 
.1804 
.0902 
10527 


BWNH-O x 


5 or more 


Using these probabilities as inputs into their mod- 
els, the statisticians use queueing theory to estimate 
voter wait times at each precinct polling location. The 
statisticians can then make recommendations on how 
many voting machines or voting booths to place at 
each precinct polling location to control voter waiting 
times. 

Discrete probability distributions, such as the Poisson 
distribution used to model voter arrivals to precinct 
polling locations, are the topic of this chapter. In addition 
to the Poisson distribution, you will learn about the bino- 
mial and the hypergeometric distributions and how they 
can be used to provide helpful probability information. 


In this chapter we extend the study of probability by introducing the concepts of random 
variables and probability distributions. Random variables and probability distributions are 
models for populations of data. The focus of this chapter is on probability distributions for 
discrete data, that is, discrete probability distributions. 

We will introduce two types of discrete probability distributions. The first type is a table 
with one column for the values of the random variable and a second column for the associated 


probabilities. We will see that the rules for assigning probabilities to experimental outcomes in- 
troduced in Chapter 4 are used to assign probabilities for such a distribution. The second type of 
discrete probability distribution uses a special mathematical function to compute the probabili- 
ties for each value of the random variable. We present three probability distributions of this type 
that are widely used in practice: the binomial, Poisson, and hypergeometric distributions. 


The concept of an exper- 


5.1 Random Variables 


A random variable provides a means for describing experimental outcomes using numer- 
ical values. Random variables must assume numerical values. 


iment and its associated 
experimental outcomes are 
discussed in Chapter 4. 
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RANDOM VARIABLE 


A random variable is a numerical description of the outcome of an experiment. 


In effect, a random variable associates a numerical value with each possible experimental 
outcome. The particular numerical value of the random variable depends on the outcome of 
the experiment. A random variable can be classified as being either discrete or continuous 
depending on the numerical values it assumes. 


Discrete Random Variables 


A random variable that may assume either a finite number of values or an infinite sequence 
of values such as 0, 1, 2, . . . is referred to as a discrete random variable. For example, 
consider the experiment of an accountant taking the certified public accountant (CPA) 
examination. The examination has four parts. We can define a random variable as x = the 
number of parts of the CPA examination passed. It is a discrete random variable because it 
may assume the finite number of values 0, 1, 2, 3, or 4. 

As another example of a discrete random variable, consider the experiment of cars arriving 
at a tollbooth. The random variable of interest is x = the number of cars arriving during a one- 
day period. The possible values for x come from the sequence of integers 0, 1, 2, and so on. 
Hence, x is a discrete random variable assuming one of the values in this infinite sequence. 

Although the outcomes of many experiments can naturally be described by numerical 
values, others cannot. For example, a survey question might ask an individual to recall 
the message in a recent television commercial. This experiment would have two possible 
outcomes: The individual cannot recall the message and the individual can recall the 
message. We can still describe these experimental outcomes numerically by defining 
the discrete random variable x as follows: Let x = 0 if the individual cannot recall the 
message and x = 1 if the individual can recall the message. The numerical values for this 
random variable are arbitrary (we could use 5 and 10), but they are acceptable in terms of 
the definition of a random variable—namely, x is a random variable because it provides a 
numerical description of the outcome of the experiment. 

Table 5.1 provides some additional examples of discrete random variables. Note that in 
each example the discrete random variable assumes a finite number of values or an infinite 
sequence of values such as 0, 1, 2,.... These types of discrete random variables are dis- 
cussed in detail in this chapter. 


TABLE 5.1 Examples of Discrete Random Variables 


Possible Values for the 


Random Experiment Random Variable (x) Random Variable 
Flip a coin Face of coin showing 1 if heads; O if tails 
Roll a die Number of dots showing on T2 3,4,5,6 

top of die 
Contact five customers Number of customers who O.1,2,3,4,5 


place an order 


Operate a health care clinic for Number of patients who arrive 0, 1, 2, 3,... 
one day 
Offer a customer the choice of Product chosen by customer 0 if none; 1 if choose 
two products product A; 2 if choose 
product B 
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TABLE 5.2 Examples of Continuous Random Variables 


Possible Values for 
Random Experiment Random Variable (x) the Random Variable 


Customer visits a web page Time customer spends on web x=0 
page in minutes 
Fill a soft drink can Number of ounces O<x<=12.1 
(max capacity = 12.1 ounces) 


Test a new chemical process Temperature when the desired 150<x<212 
reaction takes place (min 
temperature = 150°F; max 
temperature = 212°F) 


Invest $10,000 in the stock market Value of investment after one year x >Q 


Continuous Random Variables 


A random variable that may assume any numerical value in an interval or collection of 
intervals is called a continuous random variable. Experimental outcomes based on 
measurement scales such as time, weight, distance, and temperature can be described by 
continuous random variables. For example, consider an experiment of monitoring incom- 
ing telephone calls to the claims office of a major insurance company. Suppose the random 
variable of interest is x = the time between consecutive incoming calls in minutes. This 
random variable may assume any value in the interval x = 0. Actually, an infinite number 
of values are possible for x, including values such as 1.26 minutes, 2.751 minutes, and 
4.3333 minutes. As another example, consider a 90-mile section of interstate highway I-75 
north of Atlanta, Georgia. For an emergency ambulance service located in Atlanta, we 
might define the random variable as x = number of miles to the location of the next traffic 
Continuous random vari- accident along this section of I-75. In this case, x would be a continuous random variable 
ables and their probability assuming any value in the interval 0 = x = 90. Additional examples of continuous random 
distributions are covered in variables are listed in Table 5.2. Note that each example describes a random variable that 
Chapter 6. may assume any value in an interval of values. 


NOTES + COMMENTS 


One way to determine whether a random variable is discrete values of the random variable. If the entire line segment 
or continuous is to think of the values of the random variable between the two points also represents possible values for the 
as points on a line segment. Choose two points representing random variable, then the random variable is continuous. 


Methods 
1. Consider the experiment of tossing a coin twice. 

a. List the experimental outcomes. 

b. Define a random variable that represents the number of heads occurring on the two 
tosses. 

c. Show what value the random variable would assume for each of the experimental 
outcomes. 

d. Is this random variable discrete or continuous? 
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2. Consider the experiment of a worker assembling a product. 
a. Define a random variable that represents the time in minutes required to assemble 
the product. 
b. What values may the random variable assume? 
c. Is the random variable discrete or continuous? 


Applications 

3. Interviews at Brookwood Institute. Three students scheduled interviews for summer 
employment at the Brookwood Institute. In each case the interview results in either 
an offer for a position or no offer. Experimental outcomes are defined in terms of the 
results of the three interviews. 

a. List the experimental outcomes. 

b. Define a random variable that represents the number of offers made. Is the random 
variable continuous? 

c. Show the value of the random variable for each of the experimental outcomes. 

4. Unemployment in Northeastern States. The Census Bureau includes nine states in 
what it defines as the Northeast region of the United States. Assume that the govern- 
ment is interested in tracking unemployment in these nine states and that the random 
variable of interest is the number of Northeastern states with an unemployment rate 
that is less than 3.5%. What values may this random variable assume? 

5. Blood Test Analysis. To perform a certain type of blood analysis, lab technicians must 
perform two procedures. The first procedure requires either one or two separate steps, 
and the second procedure requires one, two, or three steps. 

a. List the experimental outcomes associated with performing the blood analysis. 

b. If the random variable of interest is the total number of steps required to do the 
complete analysis (both procedures), show what value the random variable will 
assume for each of the experimental outcomes. 

6. Types of Random Variables. Listed below is a series of experiments and associated 
random variables. In each case, identify the values that the random variable can as- 
sume and state whether the random variable is discrete or continuous. 


Experiment Random Variable (x) 
a. Take a 20-question examination Number of questions answered correctly 
b. Observe cars arriving at a tollbooth Number of cars arriving at tollbooth 
for 1 hour 
c. Audit 50 tax returns Number of returns containing errors 
d. Observe an employee's work Number of nonproductive hours in an 
eight-hour workday 
e. Weigh a shipment of goods Number of pounds 


5.2 Developing Discrete Probability Distributions 


The classical, subjective, The probability distribution for a random variable describes how probabilities are distrib- 
and relative frequency uted over the values of the random variable. For a discrete random variable x, a probability 
methods were introduced function, denoted by f(x), provides the probability for each value of the random variable. 
in Chapter 4. The classical, subjective, and relative frequency methods of assigning probabilities can be 


used to develop discrete probability distributions. Application of this methodology leads to 
what we call tabular discrete probability distributions, that is, probability distributions that 
are presented in a table. 

The classical method of assigning probabilities to values of a random variable is 
applicable when the experimental outcomes generate values of the random variable that 
are equally likely. For instance, consider the experiment of rolling a die and observing the 
number on the upward face. It must be one of the numbers 1, 2, 3, 4, 5, or 6, and each of 
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TABLE 5.3 Probability Distribution for Number Obtained on One Roll of a Die 


Number Obtained Probability of x 
x F(x) 
í 1/6 
2 1/6 
3 1/6 
4 1/6 
5 1/6 
6 1/6 


these outcomes is equally likely. Thus, if we let x = number obtained on one roll of a die 
and f(x) = the probability of x, the probability distribution of x is given in Table 5.3. 

The subjective method of assigning probabilities can also lead to a table of values of 
the random variable together with the associated probabilities. With the subjective method 
the individual developing the probability distribution uses her/his best judgment to assign 
each probability. So, unlike probability distributions developed using the classical method, 
different people can be expected to obtain different probability distributions. 

The relative frequency method of assigning probabilities to values of a random variable 
is applicable when reasonably large amounts of data are available. We then treat the data as 
if they were the population and use the relative frequency method to assign probabilities to 
the experimental outcomes. The use of the relative frequency method to develop discrete 
probability distributions leads to what is called an empirical discrete distribution. With 
the large amounts of data available today (e.g., scanner data, credit card data, online social 
media data), this type of probability distribution is becoming more widely used in practice. 
Let us illustrate by considering the sale of automobiles at a dealership. 

We will use the relative frequency method to develop a probability distribution for 
the number of cars sold per day at DiCarlo Motors in Saratoga, New York. Over the past 
300 days, DiCarlo has experienced 54 days with no automobiles sold, 117 days with one 
automobile sold, 72 days with two automobiles sold, 42 days with three automobiles sold, 
12 days with four automobiles sold, and 3 days with five automobiles sold. Suppose we 
consider the experiment of observing a day of operations at DiCarlo Motors and define the 
random variable of interest as x = the number of automobiles sold during a day. Using the 
relative frequencies to assign probabilities to the values of the random variable x, we can 
develop the probability distribution for x. 

In probability function notation, f(0) provides the probability of 0 automobiles sold, f(1) 
provides the probability of 1 automobile sold, and so on. Because historical data show 54 
of 300 days with 0 automobiles sold, we assign the relative frequency 54/300 = .18 to f(0), 
indicating that the probability of 0 automobiles being sold during a day is .18. Similarly, be- 
cause 117 of 300 days had 1 automobile sold, we assign the relative frequency 117/300 = .39 
to f(1), indicating that the probability of exactly 1 automobile being sold during a day is .39. 
Continuing in this way for the other values of the random variable, we compute the values for 
F(2), fG), f(4), and f(5) as shown in Table 5.4. 

A primary advantage of defining a random variable and its probability distribution is 
that once the probability distribution is known, it is relatively easy to determine the prob- 
ability of a variety of events that may be of interest to a decision maker. For example, using 
the probability distribution for DiCarlo Motors as shown in Table 5.4, we see that the most 
probable number of automobiles sold during a day is | with a probability of f(1) = .39. In 
addition, the probability of selling 3 or more automobiles during a day is f(3) + f(4) + 
(5) = .14 + .04 + .01 = .19. These probabilities, plus others the decision maker may 
ask about, provide information that can help the decision maker understand the process of 
selling automobiles at DiCarlo Motors. 
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These conditions are 

the analogs to the two 
basic requirements for 
assigning probabilities to 
experimental outcomes 
presented in Chapter 4. 
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TABLE 5.4 Probability Distribution for the Number of Automobiles Sold 


During a Day at DiCarlo Motors 


f(x) 


18 
oy) 
24 
14 
.04 
01 


Total 1.00 


OARWN=-O x 


In the development of a probability function for any discrete random variable, the fol- 
lowing two conditions must be satisfied. 


REQUIRED CONDITIONS FOR A DISCRETE PROBABILITY FUNCTION 


fo) =0 (5.1) 


YF) =1 (5.2) 


Table 5.4 shows that the probabilities for the random variable x satisfy equation (5.1); f(x) 
is greater than or equal to 0 for all values of x. In addition, because the probabilities sum 
to 1, equation (5.2) is satisfied. Thus, the DiCarlo Motors probability function is a valid 
discrete probability function. 

We can also show the DiCarlo Motors probability distribution graphically. In Figure 5.1 
the values of the random variable x for DiCarlo Motors are shown on the horizontal axis 
and the probability associated with these values is shown on the vertical axis. 


FIGURE 5.1 Graphical Representation of the Probability Distribution for the 


Number of Automobiles Sold During a Day at DiCarlo Motors 
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In addition to the probability distributions shown in tables, a formula that gives the prob- 
ability function, f(x), for every value of x is often used to describe probability distributions. 
The simplest example of a discrete probability distribution given by a formula is the discrete 
uniform probability distribution. Its probability function is defined by equation (5.3). 


DISCRETE UNIFORM PROBABILITY FUNCTION 


7G) = In (5.3) 
where 


n = the number of values the random variable may assume 


For example, consider again the experiment of rolling a die. We define the random vari- 
able x to be the number of dots on the upward face. For this experiment, n = 6 values are 
possible for the random variable; x = 1, 2, 3, 4, 5, 6. We showed earlier how the probabil- 
ity distribution for this experiment can be expressed as a table. Since the probabilities are 
equally likely, the discrete uniform probability function can also be used. The probability 
function for this discrete uniform random variable is 


fœ) = 1/6 x = 1,2,3,4,5,6 


Several widely used discrete probability distributions are specified by formulas. Three im- 
portant cases are the binomial, Poisson, and hypergeometric distributions; these distributions 
are discussed later in the chapter. 


EXERCISES 
Methods 
7. The probability distribution for the random variable x follows. 


x f(x) 
20 .20 
25 5 
30 25 
35 .40 


. Is this probability distribution valid? Explain. 

. What is the probability that x = 30? 

. What is the probability that x is less than or equal to 25? 
. What is the probability that x is greater than 30? 


adap 


Applications 
8. Operating Room Use. The following data were collected by counting the number 

of operating rooms in use at Tampa General Hospital over a 20-day period: On three 

of the days only one operating room was used, on five of the days two were used, on 

eight of the days three were used, and on four days all four of the hospital’s operating 

rooms were used. 

a. Use the relative frequency approach to construct an empirical discrete probability 
distribution for the number of operating rooms in use on any given day. 

b. Draw a graph of the probability distribution. 

c. Show that your probability distribution satisfies the required conditions for a valid 
discrete probability distribution. 
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9. Employee Retention. Employee retention is a major concern for many companies. 
A survey of Americans asked how long they have worked for their current employer 
(Bureau of Labor Statistics website). Consider the following example of sample data 
of 2,000 college graduates who graduated five years ago. 


Time with Current 
Employer (years) Number 


506 
390 
310 
218 
576 


aBWNDN > 


Let x be the random variable indicating the number of years the respondent has 

worked for her/his current employer. 

a. Use the data to develop an empirical discrete probability distribution for x. 

b. Show that your probability distribution satisfies the conditions for a valid discrete 
probability distribution. 

c. What is the probability that a respondent has been at her/his current place of 
employment for more than 3 years? 

10. Job Satisfaction of Information-Systems Managers. The percent frequency dis- 
tributions of job satisfaction scores for a sample of information-systems (IS) senior 
executives and middle managers are as follows. The scores range from a low of 1 (very 
dissatisfied) to a high of 5 (very satisfied). 


Job Satisfaction IS Senior IS Middle 
Score Executives (%) Managers (%) 
1 5 4 
2 9 10 
3 3 12 
4 42 46 
5 41 28 


Develop a probability distribution for the job satisfaction score of a senior executive. 

Develop a probability distribution for the job satisfaction score of a middle manager. 

What is the probability a senior executive will report a job satisfaction score of 4 or 5? 

. What is the probability a middle manager is very satisfied? 

Compare the overall job satisfaction of senior executives and middle managers. 

11. Mailing Machine Malfunctions. A technician services mailing machines at compa- 
nies in the Phoenix area. Depending on the type of malfunction, the service call can 
take 1, 2, 3, or 4 hours. The different types of malfunctions occur at about the same 
frequency. 

a. Develop a probability distribution for the duration of a service call. 

b. Draw a graph of the probability distribution. 

c. Show that your probability distribution satisfies the conditions required for a 

discrete probability function. 

d. What is the probability a service call will take 3 hours? 

e. A service call has just come in, but the type of malfunction is unknown. It is 3:00 P.M. 
and service technicians usually get off at 5:00 p.m. What is the probability the service 
technician will have to work overtime to fix the machine today? 


eonogp 
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12. New Cable Subscribers. Spectrum provides cable television and Internet service 
to millions of customers. Suppose that the management of Spectrum subjectively 
assesses a probability distribution for the number of new subscribers next year in the 
state of New York as follows. 


x F(x) 
100,000 10 
200,000 20) 
300,000 25 
400,000 30 
500,000 10 
600,000 105 


a. Is this probability distribution valid? Explain. 

b. What is the probability Spectrum will obtain more than 400,000 new subscribers? 

c. What is the probability Spectrum will obtain fewer than 200,000 new subscribers? 
13. Establishing Patient Trust. A psychologist determined that the number of sessions 

required to obtain the trust of a new patient is either 1, 2, or 3. Let x be a random vari- 

able indicating the number of sessions required to gain the patient’s trust. The follow- 

ing probability function has been proposed. 


f@ = A for x = 1,2, or 3 


a. Is this probability function valid? Explain. 

b. What is the probability that it takes exactly 2 sessions to gain the patient’s trust? 

c. What is the probability that it takes at least 2 sessions to gain the patient’s trust? 
14. MRA Company Projected Profits. The following table is a partial probability distri- 

bution for the MRA Company’s projected profits (x = profit in $1,000s) for the first 

year of operation (the negative value denotes a loss). 


x F(x) 

—100 a) 

(0) .20 

50 30 

100 25 

150 10) 
200 


a. What is the proper value for f(200)? What is your interpretation of this value? 
b. What is the probability that MRA will be profitable? 
c. What is the probability that MRA will make at least $100,000? 


5.3 Expected Value and Variance 
Expected Value 


The expected value, or mean, of a random variable is a measure of the central location for the 
random variable. The formula for the expected value of a discrete random variable x follows. 


The expected value is EXPECTED VALUE OF A DISCRETE RANDOM VARIABLE 
a weighted average of 
the values of the random E(x) = u = Daf) (5.4) 


variable where the weights 
are the probabilities. 


Both the notations E(x) and m are used to denote the expected value of a random variable. 
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The expected value does 
not have to be a value 
the random variable can 


assume. 


The variance is a 
weighted average of the 
squared deviations of a 
random variable from its 
mean. The weights are 
the probabilities. 
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TABLE 5.5 Calculation of the Expected Value for the Number 


of Automobiles Sold During a Day at DiCarlo Motors 


x F(x) xf (x) 
0 g 018) = 00 
1 Foy) C39) = 39 
2 .24 2(.24) = .48 
3 14 3(.14) = .42 
4 04 4(.04)= .16 
5 101 5(.01) = .05 
1.50 


E(x) = w = >xf (x) 
——— 


Equation (5.4) shows that to compute the expected value of a discrete random variable, we 
must multiply each value of the random variable by the corresponding probability f(x) and 
then add the resulting products. Using the DiCarlo Motors automobile sales example from 
Section 5.2, we show the calculation of the expected value for the number of automobiles sold 
during a day in Table 5.5. The sum of the entries in the xf(x) column shows that the expected 
value is 1.50 automobiles per day. We therefore know that although sales of 0, 1, 2, 3, 4, or 
5 automobiles are possible on any one day, over time DiCarlo can anticipate selling an aver- 
age of 1.50 automobiles per day. Assuming 30 days of operation during a month, we can use 
the expected value of 1.50 to forecast average monthly sales of 30(1.50) = 45 automobiles. 


Variance 


The expected value provides a measure of central tendency for a random variable, but 
we often also want a measure of variability, or dispersion. Just as we used the variance in 
Chapter 3 to summarize the variability in data, we now use variance to summarize the 
variability in the values of a random variable. The formula for the variance of a discrete 
random variable follows. 


VARIANCE OF A DISCRETE RANDOM VARIABLE 


Var (x) = 07 = D(x — m) fx) (5.5) 


As equation (5.5) shows, an essential part of the variance formula is the deviation, x — p, 
which measures how far a particular value of the random variable is from the expected 
value, or mean, u. In computing the variance of a random variable, the deviations are 
squared and then weighted by the corresponding value of the probability function. The 
sum of these weighted squared deviations for all values of the random variable is referred 
to as the variance. The notations Var(x) and g? are both used to denote the variance of a 
random variable. 

The calculation of the variance for the probability distribution of the number of auto- 
mobiles sold during a day at DiCarlo Motors is summarized in Table 5.6. We see that the 
variance is 1.25. The standard deviation, o, is defined as the positive square root of the 
variance. Thus, the standard deviation for the number of automobiles sold during a day is 


o = V1.25 = 1.118 


The standard deviation is measured in the same units as the random variable (o = 1.118 
automobiles) and therefore is often preferred in describing the variability of a random 
variable. The variance o° is measured in squared units and is thus more difficult to interpret. 
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TABLE 5.6 Calculation of the Variance for the Number of Automobiles 


Sold During a Day at DiCarlo Motors 


x K Cc) f(x) (x — 1)?F(x) 
0 0 - 1.50 = -1.50 2.25 18 2.25(.18) = .4050 
1 1-1.50= -.50 25 39 .25(.39) = .0975 
2 2-150= 50 25 24 .25(.24) = .0600 
3 3-1.50= 1.50 2.25 14 2.25(.14) = .3150 
4 4-150= 2.50 6.25 04 6.25(.04) = .2500 
5 5-1.50= 3.50 12.25 01 12.25(.01) = .1225 
1.2500 


Using Excel to Compute the Expected Value, Variance, 
and Standard Deviation 


The calculations involved in computing the expected value and variance for a discrete 
random variable can easily be made in an Excel worksheet. One approach is to enter the 
formulas necessary to make the calculations in Tables 5.4 and 5.5. An easier way, how- 
ever, is to make use of Excel’s SUMPRODUCT function. In this subsection we show 
how to use the SUMPRODUCT function to compute the expected value and variance 
for daily automobile sales at DiCarlo Motors. Refer to Figure 5.2 as we describe the 


FIGURE 5.2 | Excel Worksheet for Expected Value, Variance, and Standard Deviation 


J A B | € D 
1 | Sales [Probability Sq Dev from Mean 
2 |0 0.18 =(A2-$B$9)'2 
3 |1 0.39 =(A3-$B$9)^2 
4 2 0.24 =(A4-$B$9)^2 
58 0.14 =(A5-$B$9)'2 
6 4 0.04 =(A6-$B$9)^2 
7 Š 0.01 =(A7-$B$9)2 
8 
9 | Mean =SUMPRODUCT(A2:A7,B2:B7) { A B | Cc D 
10 | 1 [Sales [Probability Sq Dev from Mean 
11 Variance =SUMPRODUCT(C2:C7.B2:B7) 2 0 0.18 225 
12 | 3 | 1 0.39 0.25 
13 Std Deviation =SQRT(B11) 4 2 0.24 0.25 
14 Sil 3 0.14 2.25 
6 4 0.04 6.25 
T 5 0.01 12.25 
s 
9 | Mean is 
10 | 
11| Variance 1.25 
12 
13 Std Deviation 1.118 
14 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


5.3 Expected Value and Variance 229 


tasks involved. The formula worksheet is in the background; the value worksheet is in 
the foreground. 


Enter/Access Data: The data needed are the values for the random variable and the 
corresponding probabilities. Labels, values for the random variable, and the corresponding 
probabilities are entered in cells A1:B7. 


Enter Functions and Formulas: The SUMPRODUCT function multiplies each value 
in one range by the corresponding value in another range and sums the products. To use 
the SUMPRODUCT function to compute the expected value of daily automobile sales at 
DiCarlo Motors, we entered the following formula in cell B9: 


=SUMPRODUCT(A2:A7,B2:B7) 


Note that the first range, A2:A7, contains the values for the random variable, daily auto- 
mobile sales. The second range, B2:B7, contains the corresponding probabilities. Thus, the 
SUMPRODUCT function in cell B9 is computing A2*B2 + A3*B3 + A4*B4 + A5*B5 + 
A6*B6 + A7*B7; hence, it is applying the formula in equation (5.4) to compute the expec- 
ted value. The result, shown in cell B9 of the value worksheet, is 1.5. 

The formulas in cells C2:C7 are used to compute the squared deviations from the ex- 
pected value or mean of 1.5 (the mean is in cell B9). The results, shown in the value work- 
sheet, are the same as the results shown in Table 5.5. The formula necessary to compute the 
variance for daily automobile sales was entered in cell B11. It uses the SUMPRODUCT 
function to multiply each value in the range C2:C7 by each corresponding value in the 
range B2:B7 and sums the products. The result, shown in the value worksheet, is 1.25. 
Because the standard deviation is the square root of the variance, we entered the formula 
=SQRT(B11) in cell B13 to compute the standard deviation for daily automobile sales. The 
result, shown in the value worksheet, is 1.118. 


Methods 
15. The following table provides a probability distribution for the random variable x. 


x F(x) 
S) 25 
6 .50 
2 25 


a. Compute E(x), the expected value of x. 
b. Compute o”, the variance of x. 
c. Compute o, the standard deviation of x. 
16. The following table provides a probability distribution for the random variable y. 


ONAN X 
w 
S 


a. Compute E(y). 
b. Compute Var(y) and o. 
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Applications 
17. Golf Scores. During the summer of 2019, Coldstream Country Club in Cincinnati, 
Ohio, collected data on 443 rounds of golf played from its white tees. The data for 
each golfer’s score on the twelfth hole are contained in the file Coldstream12. 
a. Construct an empirical discrete probability distribution for the player scores on the 
DATA file 12th hole. 
Coldstream12 b. A par is the score that a good golfer is expected to get for the hole. For hole num- 
ber 12, par is four. What is the probability of a player scoring less than or equal to 
par on hole number 12? 
c. What is the expected score for hole number 12? 
d. What is the variance for hole number 12? 
e. What is the standard deviation for hole number 12? 
18. Water Supply Stoppages. The following data have been collected on the number 
of times that owner-occupied and renter-occupied units had a water supply stoppage 
lasting 6 or more hours over a 3-month period. 


(@ 


Number of Units (1,000s) 
Number of Times Owner Occupied Renter Occupied 
0 439 394 
1 1,100 760 
2 249 221 
3 98 22 
4 times or more 120 1a 


a. Define a random variable x = number of times that owner-occupied units had a wa- 
ter supply stoppage lasting 6 or more hours in the past 3 months and develop a prob- 
ability distribution for the random variable. (Let x = 4 represent 4 or more times.) 

b. Compute the expected value and variance for x. 

c. Define a random variable y = number of times that renter-occupied units had a water 
supply stoppage lasting 6 or more hours in the past 3 months and develop a probabil- 
ity distribution for the random variable. (Let y = 4 represent 4 or more times.) 

d. Compute the expected value and variance for y. 

e. What observations can you make from a comparison of the number of water supply 
stoppages reported by owner-occupied units versus renter-occupied units? 

19. New Tax Accounting Clients. New legislation passed in 2017 by the U.S. Congress 
changed tax laws that affect how many people file their taxes in 2018 and beyond. 
These tax law changes will likely lead many people to seek tax advice from their ac- 
countants (The New York Times). Backen and Hayes LLC is an accounting firm in New 
York state. The accounting firm believes that it may have to hire additional accountants 
to assist with the increased demand in tax advice for the upcoming tax season. Backen 
and Hayes LLC has developed the following probability distribution for x = number 
of new clients seeking tax advice. 


x F(x) 
20 .05 
25 20 
30 .25 
35 15 
40 15 
45 10 
50 10 
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a. Is this a valid probability distribution? Explain. 
b. What is the probability that Backen and Hayes LLC will obtain 40 or more new clients? 
c. What is the probability that Backen and Hayes LLC will obtain fewer than 
35 new clients? 
d. Compute the expected value, variance, and standard deviation of x. 
20. Automobile Insurance Damage Claims. The probability distribution for damage claims 
paid by the Newton Automobile Insurance Company on collision insurance follows. 


Payment ($) Probability 
0 85 
500 .04 
1,000 .04 
3,000 .03 
5,000 .02 
8,000 Ol 
10,000 01 


a. Use the expected collision payment to determine the collision insurance premium 
that would enable the company to break even. 

b. The insurance company charges an annual rate of $520 for the collision coverage. 
What is the expected value of the collision policy for a policyholder? (Hint: It is the 
expected payments from the company minus the cost of coverage.) Why does the 
policyholder purchase a collision policy with this expected value? 

21. Information-Systems Managers Job Satisfaction. The following probability dis- 
tributions of job satisfaction scores for a sample of information-systems (IS) senior 
executives and middle managers range from a low of | (very dissatisfied) to a high of 
5 (very satisfied). 


Probability 

Job Satisfaction IS Senior IS Middle 
Score Executives Managers 

1 105 .04 

2 .09 NG 

3 .03 12 

4 42 .46 

5 .41 28 


a. What is the expected value of the job satisfaction score for senior executives? 

b. What is the expected value of the job satisfaction score for middle managers? 

c. Compute the variance of job satisfaction scores for executives and middle 
managers. 

d. Compute the standard deviation of job satisfaction scores for both probability 
distributions. 

e. Compare the overall job satisfaction of senior executives and middle managers. 

22. Carolina Industries Product Demand. The demand for a product of Carolina Indus- 
tries varies greatly from month to month. The probability distribution in the following 
table, based on the past two years of data, shows the company’s monthly demand. 


Unit Demand Probability 
300 .20 
400 B0 
500 35 
600 15 
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a. If the company bases monthly orders on the expected value of the monthly demand, 
what should Carolina’s monthly order quantity be for this product? 

b. Assume that each unit demanded generates $70 in revenue and that each unit 
ordered costs $50. How much will the company gain or lose in a month if it places 
an order based on your answer to part (a) and the actual demand for the item is 
300 units? 

23. Coffee Consumption. In Gallup’s Annual Consumption Habits Poll, telephone in- 
terviews were conducted for a random sample of 1,014 adults aged 18 and over. One 
of the questions was, “How many cups of coffee, if any, do you drink on an average 
day?” The following table shows the results obtained (Gallup website). 


Number of Cups Number of 
per Day Responses 


365 
264 
193 

91 
4 or more 101 


WN- O 


Define a random variable x = number of cups of coffee consumed on an average day. 

Let x = 4 represent four or more cups. 

a. Develop a probability distribution for x. 

b. Compute the expected value of x. 

c. Compute the variance of x. 

d. Suppose we are only interested in adults who drink at least one cup of coffee on 
an average day. For this group, let y = the number of cups of coffee consumed on 
an average day. Compute the expected value of y and compare it to the expected 
value of x. 

24. Computer Company Plant Expansion. The J. R. Ryland Computer Company is con- 
sidering a plant expansion to enable the company to begin production of a new computer 
product. The company’s president must determine whether to make the expansion a 
medium- or large-scale project. Demand for the new product is uncertain, which for 
planning purposes may be low demand, medium demand, or high demand. The probab- 
ility estimates for demand are .20, .50, and .30, respectively. Letting x and y indicate the 
annual profit in thousands of dollars, the firm’s planners developed the following profit 
forecasts for the medium- and large-scale expansion projects. 


Medium-Scale Large-Scale 
Expansion Profit Expansion Profit 
x f(x) y f(y) 
Low 50 20 0 20 
Demand Medium 150 50 100 .50 
High 200 .30 300 30 


a. Compute the expected value for the profit associated with the two expansion 
alternatives. Which decision is preferred for the objective of maximizing the ex- 
pected profit? 

b. Compute the variance for the profit associated with the two expansion alternatives. 
Which decision is preferred for the objective of minimizing the risk or uncertainty? 
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5.4 Bivariate Distributions, Covariance, 
and Financial Portfolios 


A probability distribution involving two random variables is called a bivariate 
probability distribution. In discussing bivariate probability distributions, it is useful 
to think of a bivariate experiment. Each outcome for a bivariate experiment consists 
of two values, one for each random variable. For example, consider the bivariate 
experiment of rolling a pair of dice. The outcome consists of two values, the number 
obtained with the first die and the number obtained with the second die. As another 
example, consider the experiment of observing the financial markets for a year and 
recording the percentage gain for a stock fund and a bond fund. Again, the experimen- 
tal outcome provides a value for two random variables, the percent gain in the stock 
fund and the percent gain in the bond fund. When dealing with bivariate probability 
distributions, we are often interested in the relationship between the random variables. 
In this section, we introduce bivariate distributions and show how the covariance and 
correlation coefficient can be used as a measure of linear association between the ran- 
dom variables. We shall also see how bivariate probability distributions can be used to 
construct and analyze financial portfolios. 


A Bivariate Empirical Discrete Probability Distribution 


Recall that in Section 5.2 we developed an empirical discrete distribution for daily sales 
at the DiCarlo Motors automobile dealership in Saratoga, New York. DiCarlo has another 
dealership in Geneva, New York. Table 5.7 shows the number of cars sold at each of the 
dealerships over a 300-day period. The numbers in the bottom row (labeled “Total’”) are the 
frequencies we used to develop an empirical probability distribution for daily sales at 
DiCarlo’s Saratoga dealership in Section 5.2. The numbers in the right-most column 
(labeled “Total”) are the frequencies of daily sales for the Geneva dealership. Entries in 
the body of the table give the number of days the Geneva dealership had a level of sales 
indicated by the row, when the Saratoga dealership had the level of sales indicated by the 
column. For example, the entry of 33 in the Geneva dealership row labeled “1” and the 
Saratoga column labeled “2” indicates that for 33 days out of the 300, the Geneva dealer- 
ship sold | car and the Saratoga dealership sold 2 cars. 

Suppose we consider the bivariate experiment of observing a day of operations at 
DiCarlo Motors and recording the number of cars sold. Let us define x = number of cars 
sold at the Geneva dealership and y = the number of cars sold at the Saratoga dealership. 
We can now divide all of the frequencies in Table 5.7 by the number of observations (300) 
to develop a bivariate empirical discrete probability distribution for automobile sales at the 
two DiCarlo dealerships. Table 5.8 shows this bivariate discrete probability distribution. 
The probabilities in the lower margin provide the marginal distribution for the DiCarlo 


TABLE 5.7 Number of Automobiles Sold at DiCarlo’s Saratoga and Geneva 


Dealerships over 300 Days 


Saratoga Dealership 


Geneva Dealership 0 1 2 3 4 Total 
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TABLE 5.8 Bivariate Empirical Discrete Probability Distribution for Daily 


Sales at DiCarlo Dealerships in Saratoga and Geneva 


Saratoga Dealership 
Geneva Dealership 0 1 2 3 4 5 Total 


Motors Saratoga dealership. The probabilities in the right margin provide the marginal 
distribution for the DiCarlo Motors Geneva dealership. 

The probabilities in the body of the table provide the bivariate probability distribu- 
tion for sales at both dealerships. Bivariate probabilities are often called joint probabil- 
ities. We see that the joint probability of selling 0 automobiles at Geneva and | auto- 
mobile at Saratoga on a typical day is f(0, 1) = .1000, the joint probability of selling 
1 automobile at Geneva and 4 automobiles at Saratoga on a typical day is .0067, and 
so on. Note that there is one bivariate probability for each experimental outcome. With 
4 possible values for x and 6 possible values for y, there are 24 experimental outcomes 
and bivariate probabilities. 

Suppose we would like to know the probability distribution for total sales at both DiCarlo 
dealerships and the expected value and variance of total sales. We can define s = x + y 
as total sales for DiCarlo Motors. Working with the bivariate probabilities in Table 5.8, we 
see that f(s = 0) = .0700, f(s = 1) = .0700 + .1000 = .1700, f(s = 2) = .0300 + .1200 + 
.0800 = .2300, and so on. We show the complete probability distribution for s = x + y 

Computing covariance and along with the computation of the expected value and variance in Table 5.9. The expected 
correlation coefficients for value is E(s) = 2.6433 and the variance is Var(s) = 2.3895. 

sample data are discussed With bivariate probability distributions, we often want to know the relationship between 
in Chapter 3. the two random variables. The covariance and/or correlation coefficient are good measures 


TABLE 5.9 Calculation of the Expected Value and Variance for Total Daily 
Sales at DiCarlo Motors 


s f(s) sf(s) s— E(s) (s — E(s)? (s — E(s))? f(s) 
0 .0700 0000 =D GRR 6.9872 4891 
1 1700 1700 —1,6433 2.7005 4591 
2 2300 4600 —0.6433 0.4139 0952 
3 2900 8700 0.3567 ONA 0369 
4 1267 5067 1.3567 1.8405 2331 
5 0667 3333 2.3567 5.5539 3703 
6 0233 1400 3.3567 1.2672 2629 
7 0233 1633 4.3567 18.9805 4429 
8 .0000 .0000 5.3567 28.6939 .0000 
E(s) = 2.6433 Var (s) = 2.3895 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


5.4 Bivariate Distributions, Covariance, and Financial Portfolios 235 


TABLE 5.10 Calculation of the Expected Value and Variance of Daily 


Automobile Sales at DiCarlo Motors’ Geneva Dealership 


x fix) xf(x) x — E(x) [(x — E(x]? [x — EQ)? fod 
0 2867 .0000 ~1.1435 1.3076 3749 
1 3700 3700 = 1435 0.0206 0076 
2 2567 5134 8565 0.8565 1883 
3 0867 2601 1.8565 1.8565 2988 
E(x) = 1.11435 Var(x) = .8696 


of association between two random variables. The formula we will use for computing the 
covariance between two random variables x and y is given below. 


COVARIANCE OF RANDOM VARIABLES x AND y (see footnote 1) 


Oxy = [Var(x + y) — Var(x) — Var(y)]/2 (5.6) 


We have already computed Var(s) = Var(x + y) and, in Section 5.2, we computed 
Var(y). Now we need to compute Var(x) before we can use equation (5.6) to compute the 
covariance of x and y. Using the probability distribution for x (the right margin of Table 
5.8), we compute E(x) and Var(x) in Table 5.10. 

We can now use equation (5.6) to compute the covariance of the random variables x and y. 


Ty = [Var(x + y) — Var(x) - Var(y)]/2 = (2.3895 — .8696 — 1.25)/2 = .1350 


A covariance of .1350 indicates that daily sales at DiCarlo’s two dealerships have a positive 
relationship. To get a better sense of the strength of the relationship we can compute the 
correlation coefficient. The correlation coefficient for the two random variables x and y is 
given by equation (5.7). 


CORRELATION BETWEEN RANDOM VARIABLES x AND y 


Sree (3.7) 


From equation (5.7), we see that the correlation coefficient for two random variables is the 
covariance divided by the product of the standard deviations for the two random variables. 

Let us compute the correlation coefficient between daily sales at the two DiCarlo 
dealerships. First we compute the standard deviations for sales at the Saratoga and Geneva 
dealerships by taking the square root of the variance. 


o, = V.8696 = .9325 
o, = V1.25 = 1.1180 


Now we can compute the correlation coefficient as a measure of the linear association 
between the two random variables. 
Ory 1350 
o,  (.9325)(1.1180) 


Pry 1295 


‘Another formula is often used to compute the covariance of x and y when Var(x + y) is not known. 
It is oy = Dlx = E(x) ly; — Ey fl: y). 
ij 
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The correlation coefficient is a measure of the linear association between two variables. 
Values near +1 indicate a strong positive linear relationship; values near — 1 indicate a 
strong negative linear relationship; and values near zero indicate a lack of a linear relation- 
ship. This interpretation is also valid for random variables. The correlation coefficient of 
.1295 indicates there is a weak positive relationship between the random variables repre- 
senting daily sales at the two DiCarlo dealerships. If the correlation coefficient had equaled 
zero, we would have concluded that daily sales at the two dealerships were independent. 


Financial Applications 


Let us now see how what we have learned can be useful in constructing financial 
portfolios that provide a good balance of risk and return. A financial advisor is consider- 
ing four possible economic scenarios for the coming year and has developed a probability 
distribution showing the percent return, x, for investing in a large-cap stock fund and the 
percent return, y, for investing in a long-term government bond fund given each of the scen- 
arios. The bivariate probability distribution for x and y is shown in Table 5.11. Table 5.11 is 
simply a list with a separate row for each experimental outcome (economic scenario). Each 
row contains the joint probability for the experimental outcome and a value for each random 
variable. Since there are only 4 joint probabilities, the tabular form used in Table 5.11 
is simpler than the one we used for DiCarlo Motors where there were (4)(6) = 24 joint 
probabilities. 

Using the formula in Section 5.3 for computing the expected value of a single random 
variable, we can compute the expected percent return for investing in the stock fund, E(x), 
and the expected percent return for investing in the bond fund, E(y). 


E(x) = .10(—40) + .25(5) + .5(15) + .15(30) = 9.25 
E(y) = .10(30) + .25(5) + .5(4) + .15(2) = 6.55 


Using this information, we might conclude that investing in the stock fund is a better 
investment. It has a higher expected return, 9.25%. But financial analysts recommend that 
investors also consider the risk associated with an investment. The standard deviation of 
percent return is often used as a measure of risk. To compute the standard deviation, we 
must first compute the variance. Using the formula in Section 5.3 for computing the vari- 
ance of a single random variable, we can compute the variance of the percent returns for 
the stock and bond fund investments. 


Var(x) = .1(— 40 — 9.25} + .25(5 — 9.25)? + .50(15 — 9.25)? + .15(30 — 9.25)? = 328.1875 
Varly) = .1(30 — 6.55)” + .25(5 — 6.55)? + .50(4 — 6.55} + .15(2 — 6.55)? = 61.9475 


The standard deviation of the return from an investment in the stock fund is 
go, = V328.1875 = 18.1159% and the standard deviation of the return from an investment 
in the bond fund is o, = V 61.9475 = 7.8707%. So, we can conclude that investing in the 


TABLE 5.11 Probability Distribution of Percent Returns for Investing in 


a Large-Cap Stock Fund, x, and Investing in a Long-Term 
Government Bond Fund, y 


Probability Large-Cap Long-Term Government 
Economic Scenario f(x, y) Stock Fund (x) Bond Fund (y) 
Recession 10 —40 30 
Weak growth 25 5 5 
Stable growth -50 15 4 
Strong growth ols) 30 2 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


5.4 Bivariate Distributions, Covariance, and Financial Portfolios 237 


bond fund is less risky. It has the smaller standard deviation. We have already seen that the 
stock fund offers a greater expected return, so if we want to choose between investing in 
either the stock fund or the bond fund it depends on our attitude toward risk and return. An 
aggressive investor might choose the stock fund because of the higher expected return; a 
conservative investor might choose the bond fund because of the lower risk. But there are 
other options. What about the possibility of investing in a portfolio consisting of both an 
investment in the stock fund and an investment in the bond fund? 

Suppose we would like to consider three alternatives: investing solely in the large- 
cap stock fund, investing solely in the long-term government bond fund, and splitting 
our funds equally between the stock fund and the bond fund (one-half in each). We have 
already computed the expected value and standard deviation for investing solely in the 
stock fund and the bond fund. Let us now evaluate the third alternative: constructing a 
portfolio by investing equal amounts in the large-cap stock fund and in the long-term 
government bond fund. 

To evaluate this portfolio, we start by computing its expected return. We have previously 
defined x as the percent return from an investment in the stock fund and y as the percent 
return from an investment in the bond fund so the percent return for our portfolio is r = .5x 
+ .5y. To find the expected return for a portfolio with one-half invested in the stock fund and 
one-half invested in the bond fund, we want to compute E(r) = E(.5x + .5y). The expres- 
sion .5x + .5y is called a linear combination of the random variables x and y. Equation (5.8) 
provides an easy method for computing the expected value of a linear combination of the 
random variables x and y when we already know E(x) and E(y). In equation (5.8), a repres- 
ents the coefficient of x and b represents the coefficient of y in the linear combination. 


EXPECTED VALUE OF A LINEAR COMBINATION OF RANDOM VARIABLES x AND y 


E(ax + by) = aE(x) + bE(y) (5.8) 


Since we have already computed E(x) = 9.25 and E(y) = 6.55, we can use equation 
(5.8) to compute the expected value of our portfolio. 


E(.5x + .5y) = .5E(x) + SE(y) = .5(9.25) + .5(6.55) = 7.9 


We see that the expected return for investing in the portfolio is 7.9%. With $100 invested, 
we would expect a return of $100(.079) = $7.90; with $1,000 invested we would expect 
a return of $1,000(.079) = $79.00; and so on. But what about the risk? As mentioned 
previously, financial analysts often use the standard deviation as a measure of risk. 

Our portfolio is a linear combination of two random variables, so we need to be able to 
compute the variance and standard deviation of a linear combination of two random vari- 
ables in order to assess the portfolio risk. When the covariance between two random vari- 
ables is known, the formula given by equation (5.9) can be used to compute the variance of 
a linear combination of two random variables. 


VARIANCE OF A LINEAR COMBINATION OF TWO RANDOM VARIABLES 


Var(ax + by) = a’Var(x) + b’Var(y) + 2abo,, (5.9) 


where g, is the covariance of x and y 


From equation (5.9), we see that both the variance of each random variable individually 
and the covariance between the random variables are needed to compute the variance of a 
linear combination of two random variables and hence the variance of our portfolio. 
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TABLE 5.12 Expected Values, Variances, and Standard Deviations 
for Three Investment Alternatives 


Expected Variance Standard Deviation 
Investment Alternative Return (%) of Return of Return (%) 
100% in Stock Fund 9.25 328.1875 18.1159 
100% in Bond Fund 6.55 61.9475 7.8707 
Portfolio (50% in Stock 7.90 29.865 5.4650 


fund, 50% in Bond Fund) 


We computed Var(x + y) = We have already computed the variance of each random variable individually: Var(x) = 
119.46 the same waywe 328.1875 and Var(y) = 61.9475. Also, it can be shown that Var(x + y) = 119.46. So, using 
did for DiCarlo Motorsin equation (5.6), the covariance of the random variables x and y is 


h i b. ion. 
me previous SSCL = [Varlx + y) — Var(x) — Varly)]/2 = [119.46 — 328.1875 — 61.9475]/2 = — 135.3375 


A negative covariance between x and y, such as this, means that when x tends to be above 
its mean, y tends to be below its mean and vice versa. 
We can now use equation (5.9) to compute the variance of return for our portfolio. 


Var(.5x + .5y) = .57(328.1875) + .5°(61.9475) + 2(.5)(.5)(— 135.3375) = 29.865 


The standard deviation of our portfolio is then given by © 5x + s, = V29.865 = 5.4650%. 
This is our measure of risk for the portfolio consisting of investing 50% in the stock fund 
and 50% in the bond fund. 

Perhaps we would now like to compare the three investment alternatives: investing 
solely in the stock fund, investing solely in the bond fund, or creating a portfolio by divid- 
ing our investment amount equally between the stock and bond funds. Table 5.12 shows the 
expected returns, variances, and standard deviations for each of the three alternatives. 

Which of these alternatives would you prefer? The expected return is highest for 
investing 100% in the stock fund, but the risk is also highest. The standard deviation is 
18.1159%. Investing 100% in the bond fund has a lower expected return, but a significantly 
smaller risk. Investing 50% in the stock fund and 50% in the bond fund (the portfolio) has 
an expected return that is halfway between that of the stock fund alone and the bond fund 
alone. But note that it has less risk than investing 100% in either of the individual funds. 
Indeed, it has both a higher return and less risk (smaller standard deviation) than invest- 
ing solely in the bond fund. So we would say that investing in the portfolio dominates the 
choice of investing solely in the bond fund. 

Whether you would choose to invest in the stock fund or the portfolio depends 
on your attitude toward risk. The stock fund has a higher expected return. But the 
portfolio has significantly less risk and also provides a fairly good return. Many 
would choose it. It is the negative covariance between the stock and bond funds that 
has caused the portfolio risk to be so much smaller than the risk of investing solely 
in either of the individual funds. 

The portfolio analysis we just performed was for investing 50% in the stock fund and 
the other 50% in the bond fund. How would you calculate the expected return and the 
variance for other portfolios? Equations (5.8) and (5.9) can be used to make these calcu- 
lations easily. 

Suppose we wish to create a portfolio by investing 25% in the stock fund and 75% in 
the bond fund. What are the expected value and variance of this portfolio? The percent 
return for this portfolio is r =.25x + .75y, so we can use equation (5.8) to get the expected 
value of this portfolio: 


E(.25x + .75y) = .25E(x) + .75E(y) = .25(9.25) + .75(6.55) = 7.225 
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Likewise, we may calculate the variance of the portfolio using equation (5.9): 


Var(.25x + .75y) 


(.25)?Var(x) + (.75)? Varly) + 2(.25)(.75)0T 
.0625(328.1875) + (.5625)(61.9475) + (.375)(— 135.3375) 
4.6056 


The standard deviation of the new portfolio is Ø 35, + 75, = V 4.6056 = 2.1461. 


Summary 


We have introduced bivariate discrete probability distributions in this section. Since 
such distributions involve two random variables, we are often interested in a measure of 
association between the variables. The covariance and the correlation coefficient are the 
two measures we introduced and showed how to compute. A correlation coefficient near | 
or — l indicates a strong correlation between the two random variables, and a correlation 
coefficient near zero indicates a weak correlation between the variables. If two random 
variables are independent, the covariance and the correlation coefficient will equal zero. 
We also showed how to compute the expected value and variance of linear combinations 
of random variables. From a statistical point of view, financial portfolios are linear combi- 
nations of random variables. They are actually a special kind of linear combination called a 
weighted average. The coefficients are nonnegative and add to 1. The portfolio example we 
presented showed how to compute the expected value and variance for a portfolio consist- 
ing of an investment in a stock fund and a bond fund. The same methodology can be used 
to compute the expected value and variance of a portfolio consisting of any two financial 
assets. It is the effect of covariance between the individual random variables on the vari- 
ance of the portfolio that is the basis for much of the theory of reducing portfolio risk by 
diversifying across investment alternatives. 


NOTES + COMMENTS 


1. Equations (5.8) and (5.9), along with their extensions to added for each additional random variable. The extension of 
three or more random variables, are key building blocks in equation (5.9) is more complicated because a separate term 
financial portfolio construction and analysis. is needed for the covariance between all pairs of random 

2. Equations (5.8) and (5.9) for computing the expected value variables. We leave these extensions to more advanced texts. 
and variance of a linear combination of two random variables 3. The covariance term of equation (5.9) shows why negat- 
can be extended to three or more random variables. The ex- ively correlated random variables (investment alternatives) 
tension of equation (5.8) is straightforward; one more term is reduce the variance and, hence, the risk of a portfolio. 


EXERCISES 


Methods 
25. Given below is a bivariate distribution for the random variables x and y. 


F(x, y) x y 
o2. 50 80 
5 30 50 
3 40 60 


a. Compute the expected value and the variance for x and y. 
b. Develop a probability distribution for x + y. 
c. Using the result of part (b), compute E(x + y) and Var(x + y). 
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d. Compute the covariance and correlation for x and y. Are x and y positively related, 
negatively related, or unrelated? 

e. Is the variance of the sum of x and y bigger than, smaller than, or the same as the 
sum of the individual variances? Why? 

26. A person is interested in constructing a portfolio. Two stocks are being considered. Let 
x = percent return for an investment in stock 1, and y = percent return for an invest- 
ment in stock 2. The expected return and variance for stock 1 are E(x) = 8.45% and 
Var(x) = 25. The expected return and variance for stock 2 are E(y) = 3.20% and 
Var(y) = 1. The covariance between the returns is o, = —3. 

a. What is the standard deviation for an investment in stock 1 and for an investment in 
stock 2? Using the standard deviation as a measure of risk, which of these stocks is 
the riskier investment? 

b. What is the expected return and standard deviation, in dollars, for a person who 
invests $500 in stock 1? 

c. What is the expected percent return and standard deviation for a person who con- 
structs a portfolio by investing 50% in each stock? 

d. What is the expected percent return and standard deviation for a person who con- 
structs a portfolio by investing 70% in stock 1 and 30% in stock 2? 

e. Compute the correlation coefficient for x and y and comment on the relationship 
between the returns for the two stocks. 


Applications 

27. Canadian Restaurant Ratings. The Chamber of Commerce in a Canadian city has con- 
ducted an evaluation of 300 restaurants in its metropolitan area. Each restaurant received 
a rating on a 3-point scale on typical meal price (1 least expensive to 3 most expensive) 
and quality (1 lowest quality to 3 greatest quality). A crosstabulation of the rating data is 
shown. Forty-two of the restaurants received a rating of 1 on quality and 1 on meal price, 
39 of the restaurants received a rating of 1 on quality and 2 on meal price, and so on. Forty- 
eight of the restaurants received the highest rating of 3 on both quality and meal price. 


Meal Price (y) 


Quality (x) 1 2 3 Total 
il 84 
2 150 
3 66 
Total 78 117 105 300 


a. Develop a bivariate probability distribution for quality and meal price of a ran- 
domly selected restaurant in this Canadian city. Let x = quality rating and y = meal 
price. 

b. Compute the expected value and variance for quality rating, x. 

Compute the expected value and variance for meal price, y. 

d. The Var(x + y) = 1.6691. Compute the covariance of x and y. What can you 
say about the relationship between quality and meal price? Is this what you 
would expect? 

e. Compute the correlation coefficient between quality and meal price? What is the 
strength of the relationship? Do you suppose it is likely to find a low-cost restaurant 
in this city that is also high quality? Why or why not? 

28. Printer Manufacturing Costs. PortaCom has developed a design for a high-quality 
portable printer. The two key components of manufacturing cost are direct labor and 
parts. During a testing period, the company has developed prototypes and conducted 
extensive product tests with the new printer. PortaCom’s engineers have developed the 


© 
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following bivariate probability distribution for the manufacturing costs. Parts cost (in 
dollars) per printer is represented by the random variable x and direct labor cost (in 
dollars) per printer is represented by the random variable y. Management would like to 
use this probability distribution to estimate manufacturing costs. 


Direct Labor (y) 


Parts (x) 43 45 48 Total 
85 0.45 

95 0.55 
Total 0.30 0.4 Os 1.00 


a. Show the marginal distribution of direct labor cost and compute its expected value, 
variance, and standard deviation. 

b. Show the marginal distribution of parts cost and compute its expected value, vari- 
ance, and standard deviation. 

c. Total manufacturing cost per unit is the sum of direct labor cost and parts cost. 
Show the probability distribution for total manufacturing cost per unit. 

d. Compute the expected value, variance, and standard deviation of total manufactur- 
ing cost per unit. 

e. Are direct labor and parts costs independent? Why or why not? If you conclude that 
they are not, what is the relationship between direct labor and parts cost? 

f. PortaCom produced 1,500 printers for its product introduction. The total manufac- 
turing cost was $198,350. Is that about what you would expect? If it is higher or 
lower, what do you think may be the reason? 

29. Investment Portfolio of Index Fund and Core Bonds Fund. J.P. Morgan Asset Man- 
agement publishes information about financial investments. Between 2002 and 2011 
the expected return for the S&P 500 was 5.04% with a standard deviation of 19.45% 
and the expected return over that same period for a core bonds fund was 5.78% with a 
standard deviation of 2.13% (J.P. Morgan Asset Management, Guide to the Markets). 
The publication also reported that the correlation between the S&P 500 and core bonds 
is —.32. You are considering portfolio investments that are composed of an S&P 500 
index fund and a core bonds fund. 

a. Using the information provided, determine the covariance between the S&P 500 
and core bonds. 

b. Construct a portfolio that is 50% invested in an S&P 500 index fund and 50% in 
a core bonds fund. In percentage terms, what are the expected return and standard 
deviation for such a portfolio? 

c. Construct a portfolio that is 20% invested in an S&P 500 index fund and 80% 
invested in a core bonds fund. In percentage terms, what are the expected return and 
standard deviation for such a portfolio? 

d. Construct a portfolio that is 80% invested in an S&P 500 index fund and 20% 
invested in a core bonds fund. In percentage terms, what are the expected return and 
standard deviation for such a portfolio? 

e. Which of the portfolios in parts (b), (c), and (d) has the largest expected return? 
Which has the smallest standard deviation? Which of these portfolios is the best 
investment alternative? 

f. Discuss the advantages and disadvantages of investing in the three portfolios in 
parts (b), (c), and (d). Would you prefer investing all your money in the S&P 500 
index, the core bonds fund, or one of the three portfolios? Why? 

30. Investment Fund Including REITs. In addition to the information in exercise 
29 on the S&P 500 and core bonds, J.P. Morgan Asset Management reported that the 
expected return for real estate investment trusts (REITs) was 13.07% with a standard 
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deviation of 23.17% (J.P. Morgan Asset Management, Guide to the Markets). The cor- 

relation between the S&P 500 and REITs is .74 and the correlation between core bonds 

and REITs is —.04. You are considering portfolio investments that are composed of an 

S&P 500 index fund and REITs as well as portfolio investments composed of a core 

bonds fund and REITs. 

a. Using the information provided here and in exercise 29, determine the covariance 
between the S&P 500 and REITs and between core bonds and REITs. 

b. Construct a portfolio that is 50% invested in an S&P 500 fund and 50% invested in 
REITs. In percentage terms, what are the expected return and standard deviation for 
such a portfolio? 

c. Construct a portfolio that is 50% invested in a core bonds fund and 50% invested in 
REITs. In percentage terms, what are the expected return and standard deviation for 
such a portfolio? 

d. Construct a portfolio that is 80% invested in a core bonds fund and 20% invested in 
REITs. In percentage terms, what are the expected return and standard deviation for 
such a portfolio? 

e. Which of the portfolios in parts (b), (c), and (d) would you recommend to an 
aggressive investor? Which would you recommend to a conservative investor? 
Why? 


5.5 Binomial Probability Distribution 


The binomial probability distribution is a discrete probability distribution that has many 
applications. It is associated with a multiple-step experiment that we call the binomial 
experiment. 


A Binomial Experiment 


A binomial experiment exhibits the following four properties. 


PROPERTIES OF A BINOMIAL EXPERIMENT 


1. The experiment consists of a sequence of n identical trials. 

2. Two outcomes are possible on each trial. We refer to one outcome as a success 
and the other outcome as a failure. 

3. The probability of a success, denoted by p, does not change from trial to trial. 
Consequently, the probability of a failure, denoted by | — p, does not change 
from trial to trial. 

4. The trials are independent. 


Jakob Bernoulli If properties 2, 3, and 4 are present, we say the trials are generated by a Bernoulli process. 
(1654-1705), the first of the If, in addition, property 1 is present, we say we have a binomial experiment. Figure 5.3 
Bernoulli family of Swiss depicts one possible sequence of successes and failures for a binomial experiment 
mathematicians, published involving eight trials. 

a treatise on probability that In a binomial experiment, our interest is in the number of successes occurring in the 
contained the theory of per- n trials. If we let x denote the number of successes occurring in the n trials, we see that x 
mutations and combina- can assume the values of 0, 1, 2, 3, .. . , n. Because the number of values is finite, x is a 
tions, as well as the binomial discrete random variable. The probability distribution associated with this random variable 
theorem. is called the binomial probability distribution. For example, consider the experiment of 


tossing a coin five times and on each toss observing whether the coin lands with a head or 
a tail on its upward face. Suppose we want to count the number of heads appearing over the 
five tosses. Does this experiment show the properties of a binomial experiment? What is 
the random variable of interest? Note that 


1. The experiment consists of five identical trials; each trial involves the tossing of 
one coin. 
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FIGURE 5.3 One Possible Sequence of Successes and Failures 


for an Eight-Trial Binomial Experiment 


Property 1: The experiment consists of 
n = 8 identical trials. 


Property 2: Each trial results in either 
success (S) or failure (F). 


Trials ————— 1 2 3 4 5 6 7 8 


Outcomes ————~> S F F S S FS S 


2. Two outcomes are possible for each trial: a head or a tail. We can designate head a 
success and tail a failure. 

3. The probability of a head and the probability of a tail are the same for each trial, 
with p = .5 and 1 — p = .5. 

4. The trials or tosses are independent because the outcome on any one trial is not 
affected by what happens on other trials or tosses. 


Thus, the properties of a binomial experiment are satisfied. The random variable of interest 
is x = the number of heads appearing in the five trials. In this case, x can assume the values 
of 0, 1, 2, 3, 4, or 5. 

As another example, consider an insurance salesperson who visits 10 randomly selected 
families. The outcome associated with each visit is classified as a success if the family 
purchases an insurance policy and a failure if the family does not. From past experience, 
the salesperson knows the probability that a randomly selected family will purchase an 
insurance policy is .10. Checking the properties of a binomial experiment, we observe that 


1. The experiment consists of 10 identical trials; each trial involves contacting one 
family. 

2. Two outcomes are possible on each trial: the family purchases a policy (success) or 
the family does not purchase a policy (failure). 

3. The probabilities of a purchase and a nonpurchase are assumed to be the same for 
each sales call, with p = .10 and 1 — p = .90. 

4. The trials are independent because the families are randomly selected. 


Because the four assumptions are satisfied, this example is a binomial experiment. The 
random variable of interest is the number of sales obtained in contacting the 10 families. In 
this case, x can assume the values of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. 

Property 3 of the binomial experiment is called the stationarity assumption and is 
sometimes confused with property 4, independence of trials. To see how they differ, 
consider again the case of the salesperson calling on families to sell insurance poli- 
cies. If, as the day wore on, the salesperson got tired and lost enthusiasm, the proba- 
bility of success (selling a policy) might drop to .05, for example, by the 10th call. In 
such a case, property 3 (stationarity) would not be satisfied, and we would not have a 
binomial experiment. Even if property 4 held—that is, the purchase decisions of each 
family were made independently—it would not be a binomial experiment if property 3 
was not satisfied. 

In applications involving binomial experiments, a special mathematical formula, called 
the binomial probability function, can be used to compute the probability of x successes 
in the n trials. Using probability concepts introduced in Chapter 4, we will show in the 
context of an illustrative problem how the formula can be developed. 
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Martin Clothing Store Problem 


Let us consider the purchase decisions of the next three customers who enter the Martin 
Clothing Store. On the basis of past experience, the store manager estimates the probability 
that any one customer will make a purchase is .30. What is the probability that two of the 
next three customers will make a purchase? 

Using a tree diagram (Figure 5.4), we can see that the experiment of observing the three 
customers each making a purchase decision has eight possible outcomes. Using S to denote 
success (a purchase) and F to denote failure (no purchase), we are interested in experi- 
mental outcomes involving two successes in the three trials (purchase decisions). Next, let 
us verify that the experiment involving the sequence of three purchase decisions can be 
viewed as a binomial experiment. Checking the four requirements for a binomial experi- 
ment, we note that 


1. The experiment can be described as a sequence of three identical trials, one trial for 
each of the three customers who will enter the store. 

2. Two outcomes—the customer makes a purchase (success) or the customer does not 
make a purchase (failure)—are possible for each trial. 

3. The probability that the customer will make a purchase (.30) or will not make a 
purchase (.70) is assumed to be the same for all customers. 

4. The purchase decision of each customer is independent of the decisions of the other 
customers. 


Hence, the properties of a binomial experiment are present. 


FIGURE 5.4 Tree Diagram for the Martin Clothing Store Problem 


First Second Third l Experimental 
Customer ! Customer l Customer l Outcome Value of x 

l l l 

i i s ESS 3 

i 

l 

! (S, S, F) 2 

l 
(S, FS) 2 
(S, F, F) 1 
(FS: S) 2 
(EISTE) il 
(F, F, S) 1 
(ETETE) 0 


S = Purchase 
F = No purchase 
x = Number of customers making a purchase 
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The number of experimental outcomes resulting in exactly x successes in n trials can be 
computed using the following formula.” 


NUMBER OF EXPERIMENTAL OUTCOMES PROVIDING EXACTLY x SUCCESSES IN n TRIALS 


n n! 
(") 7 O(a)! Gy) 


n! = n(n — 1)(n — 2)+-+(2)(1) 


where 


and, by definition, 


2 
II 
= 


Now let us return to the Martin Clothing Store experiment involving three customer pur- 
chase decisions. Equation (5.10) can be used to determine the number of experimental out- 
comes involving two purchases; that is, the number of ways of obtaining x = 2 successes in 
the n = 3 trials. From equation (5.10) we have 


ni [3)\_ 3! — (3)(2)() -3 

x 2 213-2! DAA) 2 
Equation (5.10) shows that three of the experimental outcomes yield two successes. From 
Figure 5.3 we see that these three outcomes are denoted by (S, S, F), (S, F, S), and (F, S, S). 


Using equation (5.10) to determine how many experimental outcomes have three suc- 
cesses (purchases) in the three trials, we obtain 


(") E G) -3 _ 3! _ BOW _6_, 

x 3 3\(3 — 3)! 310! 3(2))) 6 

From Figure 5.4 we see that the one experimental outcome with three successes is 
identified by (S, S, S). 

We know that equation (5.10) can be used to determine the number of experimental 
outcomes that result in x successes in n trials. If we are to determine the probability of x 
successes in 7 trials, however, we must also know the probability associated with each of 
these experimental outcomes. Because the trials of a binomial experiment are independent, 
we can simply multiply the probabilities associated with each trial outcome to find the 
probability of a particular sequence of successes and failures. 

The probability of purchases by the first two customers and no purchase by the third 
customer, denoted (S, S, F), is given by 


pp(1 — p) 


With a .30 probability of a purchase on any one trial, the probability of a purchase on the 
first two trials and no purchase on the third is given by 


(.30)(.30)(.70) = (.30)?(.70) = .063 


*This formula, introduced in Chapter 4, determines the number of combinations of n objects selected x at a time. 
For the binomial experiment, this combinatorial formula provides the number of experimental outcomes (sequences 
of n trials) resulting in x successes. 
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Two other experimental outcomes also result in two successes and one failure. The 
probabilities for all three experimental outcomes involving two successes follow. 


Trial Outcomes 
Probability of 


1st 2nd 3rd Experimental Experimental 
Customer Customer Customer Outcome Outcome 
Purchase Purchase No purchase (Sasa) ppl = p) = pel — p) 
= (SOO) = O58 
Purchase No purchase Purchase (STE S) pil = aa = i = a) 
= (.30)°(.70) = .063 
No purchase Purchase Purchase (2S, S) (1 — p)pp = p? =a) 


= (70) = ES 


Observe that all three experimental outcomes with two successes have exactly the 
same probability. This observation holds in general. In any binomial experiment, all 
sequences of trial outcomes yielding x successes in n trials have the same probability 
of occurrence. The probability of each sequence of trials yielding x successes in n trials 
follows. 


Probability of a particular 
sequence of trial outcomes = p*(1 — p)"” (5.11) 
with x successes in n trials 


For the Martin Clothing Store, this formula shows that any experimental outcome with two 
successes has a probability of p°(1 — p)? = p*(1 — p)! = (.30)7(.70)! = .063. 

Because equation (5.10) shows the number of outcomes in a binomial experiment 
with x successes and equation (5.11) gives the probability for each sequence involving 
x successes, we combine equations (5.10) and (5.11) to obtain the following binomial 
probability function. 


BINOMIAL PROBABILITY FUNCTION 


rey = ("pa = yr (5.12) 


where 


x = the number of successes 
p = the probability of a success on one trial 
n = the number of trials 

F(x) = the probability of x successes in n trials 


EN n! 
x) x\(n—x)! 


For the binomial probability distribution, x is a discrete random variable with the 
probability function f(x) applicable for values of x = 0, 1, 2,..., 7. 

In the Martin Clothing Store example, let us use equation (5.12) to compute the probab- 
ility that no customer makes a purchase, exactly one customer makes a purchase, exactly 
two customers make a purchase, and all three customers make a purchase. The calculations 
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TABLE 5.13 Probability Distribution for the Number of Customers 


Making a Purchase 


x f(x) 

0 2! 30)%.70) = 343 
ozr © (707 = . 

1 22 30)'(.702 = .441 
TAN MAPE 

2 sate 30)7(.70)' = .189 
ain t PEDE 

3 oe 30) 70)°= 027 
3101 C WTO ae L 


1.000 


are summarized in Table 5.13, which gives the probability distribution of the number of 
customers making a purchase. Figure 5.5 is a graph of this probability distribution. 

The binomial probability function can be applied to any binomial experiment. If we 
are satisfied that a situation demonstrates the properties of a binomial experiment and if 
we know the values of n and p, we can use equation (5.12) to compute the probability of x 
successes in the n trials. 

If we consider variations of the Martin experiment, such as 10 customers rather than 
3 entering the store, the binomial probability function given by equation (5.12) is still 
applicable. Suppose we have a binomial experiment with n = 10, x = 4, and p = .30. The 
probability of making exactly four sales to 10 customers entering the store is 


10! 


= 46! (.30)*(.70)° = .2001 


f(4) 


FIGURE 5.5 Graphical Representation of the Probability Distribution for the 
Number of Customers Making a Purchase 


Probability 
wo 
S 


0 1 2 3 
Number of Customers Making a Purchase 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


248 Chapter 5 Discrete Probability Distributions 


Using Excel to Compute Binomial Probabilities 


For many probability functions that can be specified as formulas, Excel provides functions 
for computing probabilities and cumulative probabilities. In this section, we show how 
Excel’s BINOM.DIST function can be used to compute binomial probabilities and cumu- 
lative binomial probabilities. We begin by showing how to compute the binomial proba- 
bilities for the Martin Clothing Store example shown in Table 5.13. Refer to Figure 5.6 

as we describe the tasks involved. The formula worksheet is in the background; the value 
worksheet is in the foreground. 


Enter/Access Data: In order to compute a binomial probability we must know the number 
of trials (n), the probability of success (p), and the value of the random variable (x). For 
the Martin Clothing Store example, the number of trials is 3; this value has been entered 

in cell D1. The probability of success is .3; this value has been entered in cell D2. Because 
we want to compute the probability for x = 0, 1, 2, and 3, these values were entered in 
cells BS:B8. 


Enter Functions and Formulas: The BINOM.DIST function has four inputs: The first is the 
value of x, the second is the value of n, the third is the value of p, and the fourth is FALSE 

or TRUE. We choose FALSE for the fourth input if a probability is desired and TRUE if 

a cumulative probability is desired. The formula =BIJNOM. DIST(B5,$D$1,$D$2, FALSE) 

has been entered in cell C5 to compute the probability of 0 successes in 3 trials. Note in the 
value worksheet that the probability computed for f(0), .343, is the same as that shown in 
Table 5.13. The formula in cell C5 is copied to cells C6:C8 to compute the probabilities for 

x = 1, 2, and 3 successes, respectively. 


We can also compute cumulative probabilities using Excel’s BINOM.DIST function. 
To illustrate, let us consider the case of 10 customers entering the Martin Clothing Store 
and compute the probabilities and cumulative probabilities for the number of customers 
making a purchase. Recall that the cumulative probability for x = 1 is the probability of 
1 or fewer purchases, the cumulative probability for x = 2 is the probability of 2 or fewer 
purchases, and so on. So, the cumulative probability for x = 10 is 1. Refer to Figure 5.7 
as we describe the tasks involved in computing these cumulative probabilities. The for- 
mula worksheet is in the background; the value worksheet is in the foreground. 


Enter/Access Data: We entered the number of trials (/0) in cell D1, the probability of 
success (.3) in cell D2, and the values for the random variable in cells B5:B15. 


Excel Worksheet for Computing Binomial Probabilities of Number of Customers 
Making a Purchase 


ALA Ba ] E = a Sa 
ul | | Number of Trials (7) 3 
2 | Probability of Success (p ) 0.3 
3 
4 | x Sto 
zal 0 =BINOM.DIST(B5,$D$1,$D$2,FALSE) AM e | c D E 
6 | 1 =BINOM DIST(B6,$D$1,SD$2,FALSE) i| | Number of Trials (n) 3 
7 | 2 =BINOM.DIST(B7,SD$1,SD$2,.FALSE) 2 Probability of Success (p). 0.3 
8. 3 =BINOM.DIST(B8,$D$1,$D$2,FALSE) 3 | 
2 4 | x St) 
51 0 0.343 
6 1 0.441 
7 | 2 0.189 
8| 3 0.027 
9 
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FIGURE 5.7 | Excel Worksheet for Computing Probabilities and Cumulative Probabilities for 
Number of Purchases with 10 Customers 


ALE B c = D IE 

E | Number of Trials (7) 10 i 

2) Probability of Success (p ) 0.3 

3 

4] x fx) Cum Prob 

5] v =BINOMDIST(BS,SDS1,SDS2,FALSE) =BINOM.DIST(BS,SDSI,SDS2, TRUE) 

6 1  =BINOM.DIST(B6,$D$1,$D$2,FALSE) =BINOM.DIST(B6,$D$1,$D$2,TRUE) dj iy a co D E 

7| 2 =BINOM DIST(B7,SDS1,SDS2,FALSE) =BINOM.DIST(B7,SDS1,SDS2,TRUE) My p Number of Trials (1) 

s 3 =BINOMDIST(BS,SDS1,SDS2,FALSE) =BINOMDIST(B8,SDSISDS2,TRUE) 2 Probability of Success (P) 03 

5| 4 =BINOM.DIST(B9,SDS1,SDS2,FALSE) =BINOM.DIST(B9,SDS1,SDS2, TRUE) 3 

10 | 5 =BINOM.DIST(B10,SDS1,SDS2,FALSE) =BINOM.DIST(B10,SDS1,SDS2,TRUE) 4 p fe) Cum Prob 

ii] 6 =BINOMDIST(B11,SDS1,SDS2,FALSE) =BINOM.DIST(B11,SDS1,SDS2.TRUE) 5 =N 0.0282 0.0282 

12) 7 _=BINOMDIST(B12,SDS1,SDS2,FALSE) =BINOM DIST(B12,SDS1,SDS2,TRUE) 6 1 0.1211 0.1493 

13 8 =BINOM.DIST(B13,SDS$1,SDS2,FALSE) =BINOM.DIST(B13,SDS1,SDS2,TRUE) 7 2 0.2335 0.3828 

14 EN =BINOM.DIST(B14,$D$1,SD$2,FALSE) =BINOMDIST(B14,SDS1,SDS2,TRUE)  gş 3 0.2668 0.6496 

15 aT _ =BINOMDIST(B15,SDS1,SDS2,FALSE) =BINOMDIST(B15,SDS1,SDS2,TRUE) o 4 0.2001 0.8497 

u 10 3s 0.1029 0.9527 
11 6 0.0368 0.9894 
12| 7 0.0090 0.9984 
13. 8 0.0014 0.9999 
14 Ko 0.0001 1.0000 
15 10 0.0000 1.0000 


Enter Functions and Formulas: The binomial probabilities for each value of the ran- 
dom variable are computed in column C and the cumulative probabilities are computed in 
column D. We entered the formula =BINOM.DIST(B5,$D$1,$D$2, FALSE) in cell C5 to 
compute the probability of 0 successes in 10 trials. Note that we used FALSE as the fourth 
input in the BINOM.DIST function. The probability (.0282) is shown in cell C5 of the 
value worksheet. The formula in cell C5 is simply copied to cells C6:C15 to compute the 
remaining probabilities. 


To compute the cumulative probabilities we start by entering the formula =BINOM.DIST 
(B5,$D$1,$D$2, TRUE) in cell D5. Note that we used TRUE as the fourth input in the 
BINOM.DIST function. The formula in cell D5 is then copied to cells D6:D15 to compute 
the remaining cumulative probabilities. In cell D5 of the value worksheet we see that the 
cumulative probability for x = 0 is the same as the probability for x = 0. Each of the re- 
maining cumulative probabilities is the sum of the previous cumulative probability and the 
individual probability in column C. For instance, the cumulative probability for x = 4 is 
given by .6496 + .2001 = .8497. Note also that the cumulative probability for x = 10 is 1. 
The cumulative probability of x = 9 is also 1 because the probability of x = 10 is zero (to 
four decimal places of accuracy). 


Expected Value and Variance 
for the Binomial Distribution 


In Section 5.3 we provided formulas for computing the expected value and variance 

of a discrete random variable. In the special case where the random variable has a bi- 
nomial distribution with a known number of trials n and a known probability of suc- 
cess p, the general formulas for the expected value and variance can be simplified. The 
results follow. 


EXPECTED VALUE AND VARIANCE FOR THE BINOMIAL DISTRIBUTION 
E(x) = u = np (5.13) 


Var(x) = œ = np(1 — p) (5.14) 
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For the Martin Clothing Store problem with three customers, we can use equation (5.13) 
to compute the expected number of customers who will make a purchase. 


E(x) = np = 3(.30) = 


Suppose that for the next month the Martin Clothing Store forecasts 1,000 customers will 
enter the store. What is the expected number of customers who will make a purchase? The 
answer is u = np = (1,000)(.3) = 300. Thus, to increase the expected number of purchases, 
Martin Clothing Store must induce more customers to enter the store and/or somehow in- 
crease the probability that any individual customer will make a purchase after entering. 

For the Martin Clothing Store problem with three customers, we see that the variance 
and standard deviation for the number of customers who will make a purchase are 


o = np(1 — p) = 3(.3)(.7) = 
o = V.63 = .79 


For the next 1,000 customers entering the store, the variance and standard deviation for 
the number of customers who will make a purchase are 


a? = np(1 — p) = 1,000(.3)(.7) = 210 
= V210 = 14.49 


EXERCISES 


Methods 

31. Consider a binomial experiment with two trials and p = .4. 
a. Draw a tree diagram for this experiment (see Figure 5.3). 
b. Compute the probability of one success, f (1). 
c. Compute f (0). 
d. Compute f (2). 
e. Compute the probability of at least one success. 
f. Compute the expected value, variance, and standard deviation. 

32. Consider a binomial experiment with n = 10 and p = .10. 
a. Compute f(0). 

Compute f (2). 

Compute P(x = 2). 

. Compute P(x = 1). 

Compute E(x). 

Compute Var(x) and o. 

33% Consider a binomial experiment with n = 20 and p = .70. 
a. Compute f(12). 

Compute f (16). 

Compute P(x = 16). 

. Compute P(x = 15). 

Compute E(x). 

Compute Var (x) and o. 


hones 


moan 


Applications 
34. How Teenagers Listen to Music. For its Music 360 survey, Nielsen Co. asked teenagers 
how they listened to music in the past 12 months. Nearly two-thirds of U.S. teenagers 
under the age of 18 say they use YouTube to listen to music and 35% of the teenagers 
said they use Pandora’s custom online radio service (The Wall Street Journal). Suppose 
10 teenagers are selected randomly to be interviewed about how they listen to music. 
a. Is randomly selecting 10 teenagers and asking whether or not they use Pandora’s 
online service a binomial experiment? 
b. What is the probability that none of the 10 teenagers uses Pandora’s online radio 
service? 
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c. What is the probability that 4 of the 10 teenagers use Pandora’s online radio 
service? 

d. What is the probability that at least 2 of the 10 teenagers use Pandora’s online radio 
service? 
35. Appeals for Medicare Service. The Center for Medicare and Medical Services reported 
that there were 295,000 appeals for hospitalization and other Part A Medicare service. 
For this group, 40% of first-round appeals were successful (The Wall Street Journal). 
Suppose 10 first-round appeals have just been received by a Medicare appeals office. 
a. Compute the probability that none of the appeals will be successful. 
b. Compute the probability that exactly one of the appeals will be successful. 
c. What is the probability that at least two of the appeals will be successful? 
d. What is the probability that more than half of the appeals will be successful? 
36. Number of Defective Parts. When a new machine is functioning properly, only 3% of 
the items produced are defective. Assume that we will randomly select two parts pro- 
duced on the machine and that we are interested in the number of defective parts found. 
a. Describe the conditions under which this situation would be a binomial experiment. 
b. Draw a tree diagram similar to Figure 5.4 showing this problem as a two-trial experi- 
ment. 

c. How many experimental outcomes result in exactly one defect being found? 

d. Compute the probabilities associated with finding no defects, exactly one defect, 
and two defects. 

37. Americans Saving for Retirement. According to a 2018 survey by Bankrate.com, 
20% of adults in the United States save nothing for retirement (CNBC website). Sup- 
pose that 15 adults in the United States are selected randomly. 

a. Is the selection of the 15 adults a binomial experiment? Explain. 

b. What is the probability that all of the selected adults save nothing for retirement? 

c. What is the probability that exactly five of the selected adults save nothing for retirement? 

d. What is the probability that at least one of the selected adults saves nothing for 
retirement? 

38. Detecting Missile Attacks. Military radar and missile detection systems are designed 
to warn a country of an enemy attack. A reliability question is whether a detection 
system will be able to identify an attack and issue a warning. Assume that a particular 
detection system has a .90 probability of detecting a missile attack. Use the binomial 
probability distribution to answer the following questions. 

a. What is the probability that a single detection system will detect an attack? 

b. If two detection systems are installed in the same area and operate independently, 
what is the probability that at least one of the systems will detect the attack? 

c. If three systems are installed, what is the probability that at least one of the systems 
will detect the attack? 

d. Would you recommend that multiple detection systems be used? Explain. 

39. Web Browser Market Share. Market-share-analysis company Net Applications 
monitors and reports on Internet browser usage. According to Net Applications, in the 
summer of 2014, Google’s Chrome browser exceeded a 20% market share for the first 
time, with a 20.37% share of the browser market (Forbes website). For a randomly 
selected group of 20 Internet browser users, answer the following questions. 

a. Compute the probability that exactly 8 of the 20 Internet browser users use Chrome 
as their Internet browser. 

b. Compute the probability that at least 3 of the 20 Internet browser users use Chrome 
as their Internet browser. 

c. For the sample of 20 Internet browser users, compute the expected number of 
Chrome users. 

d. For the sample of 20 Internet browser users, compute the variance and standard 
deviation for the number of Chrome users. 

40. Contributing to Household Income. A study conducted by the Pew Research Center 
showed that 75% of 18- to 34-year-olds living with their parents say they contribute 
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to household expenses (The Wall Street Journal). Suppose that a random sample of 

fifteen 18- to 34-year-olds living with their parents is selected and asked if they con- 

tribute to household expenses. 

a. Is the selection of the fifteen 18- to 34-year-olds living with their parents a binomial 
experiment? Explain. 

b. If the sample shows that none of the fifteen 18- to 34-year-olds living with their 
parents contributes to household expenses, would you question the results of the Pew 
Research study? Explain. 

c. What is the probability that at least 10 of the fifteen 18- to 34-year-olds living with 
their parents contribute to household expenses? 

41. Introductory Statistics Course Withdrawals. A university found that 20% of its 
students withdraw without completing the introductory statistics course. Assume that 
20 students registered for the course. 

a. Compute the probability that 2 or fewer will withdraw. 

b. Compute the probability that exactly 4 will withdraw. 

c. Compute the probability that more than 3 will withdraw. 

d. Compute the expected number of withdrawals. 

42. State of the Nation Survey. A Gallup Poll showed that 30% of Americans are satis- 
fied with the way things are going in the United States (Gallup website). Suppose a 
sample of 20 Americans is selected as part of a study of the state of the nation. The 
Americans in the sample are asked whether or not they are satisfied with the way 
things are going in the United States. 

a. Compute the probability that exactly 4 of the 20 Americans surveyed are satisfied 
with the way things are going in the United States. 

b. Compute the probability that at least 2 of the Americans surveyed are satisfied with 
the way things are going in the United States. 

c. For the sample of 20 Americans, compute the expected number of Americans who 
are satisfied with the way things are going in the United States. 

d. For the sample of 20 Americans, compute the variance and standard deviation of 
the number of Americans who are satisfied with the way things are going in the 
United States. 

43. Tracked E-mails. According to a 2017 Wired magazine article, 40% of e-mails that 
are received are tracked using software that can tell the e-mail sender when, where, 
and on what type of device the e-mail was opened (Wired magazine website). Suppose 
we randomly select 50 received e-mails. 

a. What is the expected number of these e-mails that are tracked? 

b. What are the variance and standard deviation for the number of these e-mails that 
are tracked? 


5.6 Poisson Probability Distribution 


In this section, we consider a discrete random variable that is often useful in estimating 
the number of occurrences over a specified interval of time or space. For example, the 
The Poisson probability random variable of interest might be the number of arrivals to a hospital emergency room, 
distribution is often used to the number of repairs needed in 10 miles of highway, or the number of leaks in 100 miles 
model random arrivals in of pipeline. If the following two properties are satisfied, the number of occurrences is a 
waiting line situations. random variable described by the Poisson probability distribution. 


PROPERTIES OF A POISSON EXPERIMENT 


1. The probability of an occurrence is the same for any two intervals of equal length. 
2. The occurrence or nonoccurrence in any interval is independent of the occur- 
rence or nonoccurrence in any other interval. 
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The Poisson probability function is defined by equation (5.15). 


Siméon Poisson taught 

mathematics at the Ecole 

Polytechnique in Paris from POISSON PROBABILITY FUNCTION 

1802 to 1808. In 1837, he pie 
published a work entitled Sx) z a 
“Researches on the Prob- 3 


(5.15) 


where 
ability of Criminal and Civil 


Verdicts,” which includes a f(x) = the probability of x occurrences in an interval 

discussion of what later be- p = expected value or mean number of occurrences in an interval 
came known as the Poisson 2 = 2.71828 

distribution. 


For the Poisson probability distribution, x is a discrete random variable indicating the 
number of occurrences in the interval. Since there is no stated upper limit for the number 
of occurrences, the probability function f(x) is applicable for values x = 0, 1, 2,... without 
limit. In practical applications, x will eventually become large enough so that f(x) is ap- 
proximately zero and the probability of any larger values of x becomes negligible. 


An Example Involving Time Intervals 


Suppose that we are interested in the number of patients who arrive at an emergency room of 
a large hospital during a 15-minute period on weekday mornings. If we can assume that the 
probability of a patient arriving is the same for any two time periods of equal length and that 
the arrival or nonarrival of a patient in any time period is independent of the arrival or nonar- 
rival in any other time period, the Poisson probability function is applicable. Suppose these 
assumptions are satisfied and an analysis of historical data shows that the average number of 
patients arriving in a 15-minute period of time is 10; in this case, the following probability 
function applies. 


10*e'° 


x! 


The random variable here is x = number of patients arriving in any 15-minute period. 
If management wanted to know the probability of exactly five arrivals in 15 minutes, we 
would set x = 5 and thus obtain 


Probability of exactly 10%e ~" 


5 arrivals in 15 minutes =f(5)= a -0378 


The probability of five arrivals in 15 minutes was obtained by using a calculator to 
evaluate the probability function. Excel also provides a function called POISSON.DIST for 
computing Poisson probabilities and cumulative probabilities. This function is easier to use 
when numerous probabilities and cumulative probabilities are desired. At the end of this 
section, we show how to compute these probabilities with Excel. 

A property of the Poisson In the preceding example, the mean of the Poisson distribution is u = 10 arrivals per 
distribution is that the 15-minute period. A property of the Poisson distribution is that the mean of the distribution 
mean and variance are and the variance of the distribution are equal. Thus, the variance for the number of arrivals 
equal. during 15-minute periods is ø? = 10. The standard deviation is o = V10 = 3.16. 

Our illustration involves a 15-minute period, but other time periods can be used. Suppose 
we want to compute the probability of one arrival in a 3-minute period. Because 10 is the 
expected number of arrivals in a 15-minute period, we see that 10/15 = 2/3 is the expected 
number of arrivals in a 1-minute period and that (2/3)(3 minutes) = 2 is the expected num- 
ber of arrivals in a 3-minute period. Thus, the probability of x arrivals in a 3-minute time 
period with u = 2 is given by the following Poisson probability function. 


27 e~ 2 


x! 


f(x) = 
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The probability of one arrival in a 3-minute period is calculated as follows: 


fe" 
~ dt 


Probability of exactly 


1 arrival in 3 minutes EN 


=U) 
One might expect that because (5 arrivals)/5 = 1 arrival and (15 minutes)/5 = 

3 minutes, the probability of one arrival during a 3-minute period and the probabil- 
ity of five arrivals during a 15-minute period would be identical. Earlier we com- 
puted the probability of five arrivals in a 15-minute period to be .0378. However, 
note that the probability of one arrival in a 3-minute period (.2707) is not the same. 
When computing a Poisson probability for a different time interval, we must first 
convert the mean arrival rate to the time period of interest and then compute the 
probability. 


An Example Involving Length or Distance Intervals 


Let us illustrate an application not involving time intervals in which the Poisson distribu- 
tion is useful. Suppose we are concerned with the occurrence of major defects in a highway 
one month after resurfacing. We will assume that the probability of a defect is the same for 
any two highway intervals of equal length and that the occurrence or nonoccurrence of a 
defect in any one interval is independent of the occurrence or nonoccurrence of a defect in 
any other interval. Hence, the Poisson distribution can be applied. 

Suppose we learn that major defects one month after resurfacing occur at the average 
rate of two per mile. Let us find the probability of no major defects in a particular 3-mile 
section of the highway. Because we are interested in an interval with a length of 3 miles, 

p = (2 defects/mile)(3 miles) = 6 represents the expected number of major defects over 
the 3-mile section of highway. Using equation (5.11), the probability of no major defects is 
f(O) = 6°e °/0! = .0025. Thus, it is unlikely that no major defects will occur in the 3-mile 
section. In fact, this example indicates a 1 — .0025 = .9975 probability of at least one 
major defect in the 3-mile highway section. 


Using Excel to Compute Poisson Probabilities 


The Excel function for computing Poisson probabilities and cumulative probabilities is 
called POISSON.DIST. It works in much the same way as the Excel function for comput- 
ing binomial probabilities. Here we show how to use it to compute Poisson probabilities 
and cumulative probabilities. To illustrate, we use the example introduced earlier in this 
section: cars arrive at a bank drive-up teller window at the mean rate of 10 per 15-minute 
time interval. Refer to Figure 5.8 as we describe the tasks involved. 


Enter/Access Data: In order to compute a Poisson probability, we must know the mean 
number of occurrences (u) per time period and the number of occurrences for which we 
want to compute the probability (x). For the drive-up teller window example, the occur- 
rences of interest are the arrivals of cars. The mean arrival rate is 70, which has been 
entered in cell D1. Earlier in this section, we computed the probability of 5 arrivals. But 
suppose we now want to compute the probability of 0 up through 20 arrivals. To do so, we 
enter the values 0, 1, 2,..., 20 in cells A4:A24. 


Enter Functions and Formulas: The POISSON.DIST function has three inputs: The first 
is the value of x, the second is the value of u, and the third is FALSE or TRUE. We choose 
FALSE for the third input if a probability is desired and TRUE if a cumulative probability 

is desired. The formula =POISSON. DIST(A4,$D$1, FALSE) has been entered in cell B4 to 
compute the probability of 0 arrivals in a 15-minute period. The value worksheet in the fore- 
ground shows that the probability of 0 arrivals is 0.0000. The formula in cell B4 is copied 

to cells B5:B24 to compute the probabilities for 1 through 20 arrivals. Note, in cell B9 of 
the value worksheet, that the probability of 5 arrivals is .0378. This result is the same as we 
calculated earlier in the text. 
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FIGURE 5.8 | Excel Worksheet for Computing Poisson Probabilities 


Aj A B c D E 
— Mean No. of Occurrences w 
2 
| No.of 
3 Arrivals (x) Probability six) 
40 —-=POISSON_DIST(A4,SD$1.FALSE) 
si _ =POISSON.DIST(AS,SD$1,FALSE) 
6 i _ =POISSON.DIST(A6,SD$1,FALSE) 
73 _ =POISSON.DIST(A7.SDS1,FALSE) 
34 =POISSON.DIST(A8,SDS1,FALSE) 
9 is _ =POISSON.DIST(A9,SD$1,FALSE) 
10 6 _ =POISSON.DIST(A10,SD$1,FALSE) 
11/7 =POISSON DIST(A11,SD$1,FALSE) 
128 “POISSON .DIST(A12,$D81. FALSE) Poisson Probabilities 
139 _ =POISSON.DIST(A13,SD$1,FALSE) 
14 10 _=POISSON.DIST(A14,$D$1,FALSE) 
1501 =POISSON.DIST(A1$,$D$1,FALSE) 
16 12 =POISSON.DIST(A16,$D$1,FALSE) | 
713 =POISSON.DIST(A17,$DS1,FALSE) 0.0901 
18 14 =POISSON DIST(A18,SDS1,FALSE) 1126 f 
19 15 =POISSON.DIST(A19,$D$1,FALSE) i f 
20 16 =POISSON.DIST(A20,$D$1,FALSE) | | 
2017 _ =POISSON DIST(A21,SDS1, FALSE) 37 il ron 
22 18 =POISSON.DIST(A22,SD$1,FALSE) i 0223534567 & 0 2041 42 43 14 45 26 47 18 29 20 
2319  =POISSON.DIST(A23,$DSI,FALSE) 729 ican 
24 20 =POISSON.DIST(A24,SD$1,FALSE) 


Notice how easy it was to compute all the probabilities for 0 through 20 arrivals using 
the POISSON.DIST function. These calculations would take quite a bit of work using a 
calculator. We have also used Excel’s chart tools to develop a graph of the Poisson probab- 
ility distribution of arrivals. See the value worksheet in Figure 5.8. This chart gives a nice 
graphical presentation of the probabilities for the various number of arrival possibilities in 
a 15-minute interval. We can quickly see that the most likely number of arrivals is 9 or 10 
and that the probabilities fall off rather smoothly for smaller and larger values. 

Let us now see how cumulative probabilities are generated using Excel’s POISSON.DIST 
function. It is really a simple extension of what we have already done. We again use the example 
of arrivals at a drive-up teller window. Refer to Figure 5.9 as we describe the tasks involved. 


Enter/Access Data: To compute cumulative Poisson probabilities we must provide the 
mean number of occurrences ( u) per time period and the values of x that we are interested 
in. The mean arrival rate (70) has been entered in cell D1. Suppose we want to compute the 
cumulative probabilities for a number of arrivals ranging from zero up through 20. To do 
so, we enter the values 0, 1, 2,..., 20 in cells A4:A24. 


Enter Functions and Formulas: Refer to the formula worksheet in the background 

of Figure 5.8. The formulas we enter in cells B4:B24 of Figure 5.9 are the same as in 
Figure 5.8 with one exception. Instead of FALSE for the third input, we enter the word 
TRUE to obtain cumulative probabilities. After entering these formulas in cells B4:B24 of 
the worksheet in Figure 5.9, the cumulative probabilities shown were obtained. 


Note, in Figure 5.9, that the probability of 5 or fewer arrivals is .0671 and that the proba- 
bility of 4 or fewer arrivals is .0293. Thus, the probability of exactly 5 arrivals is the differ- 
ence in these two numbers: f(5) = .0671 — .0293 = .0378. We computed this probability 
earlier in this section and in Figure 5.8. Using these cumulative probabilities, it is easy to 
compute the probability that a random variable lies within a certain interval. For instance, 
suppose we wanted to know the probability of more than 5 and fewer than 16 arrivals. 
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| FIGURE 5.9 | Excel Worksheet for Computing Cumulative Poisson Probabilities 
B | G D| £ | 
Mean No. of Occurrences 


Probability f(x) 
=POISSON DIST(A4.$D$1.TRUE) 
=POISSON.DIST(A5.$D$1, TRUE) 
=POISSON.DIST(A6.$D$1, TRUE) 
=POISSON.DIST(A7.$D$1,TRUE) 
=POISSON.DIST(A8.$D$1, TRUE) 
=POISSON.DIST(A9.$D$1.TRUE) 


No. of Arrivals (x) 


B | Cc D E | 
Mean No. of Occurrences 
No. of 


Arrivals (x) Probability f(x) 


=POISSON.DIST(A10,$D$1. TRUE) 0.0000 
=POISSON.DIST(A11,$D$1, TRUE) 0.0005 
=POISSON DIST(A12.$D$1, TRUE) 0.0028 
=POISSON.DIST(A13.$D$1, TRUE) 0.0103 
=POISSON.DIST(A14,$D$1, TRUE) 0.0293 
=POISSON.DIST(A15,$D$1,TRUE) 0.0671 
=POISSON.DIST(A16,$D$1, TRUE) 0.1301 
=POISSON.DIST(A17,$D$1,TRUE) 0.2202 
=POISSON.DIST(A18,$D$1,.TRUE) 0.3328 
=POISSON.DIST(A19,$D$1, TRUE) 0.4579 
=POISSON.DIST(A20,$D$1.TRUE) 0.5830 
=POISSON DIST(A21,$D$1, TRUE) 0.6968 
=POISSON.DIST(A22.$D$1.TRUE) 0.7916 
=POISSON.DIST(A23,$D$1.TRUE) 0.8645 
=POISSON.DIST(A24,$D$1.TRUE) 0.9165 

0.9513 

0.9730 

0.9857 

0.9928 

0.9965 

0.9984 


We would just find the cumulative probability of 15 arrivals and subtract from that the 
cumulative probability for 5 arrivals. Referring to Figure 5.9 to obtain the appropriate 
probabilities, we obtain .9513 — .0671 = .8842. With such a high probability, we could 
conclude that 6 to 15 cars will arrive in most 15-minute intervals. Using the cumulative 
probability for 20 arrivals, we can also conclude that the probability of more than 20 ar- 
rivals in a 15-minute period is 1 — .9984 = .0016; thus, there is almost no chance of more 
than 20 cars arriving. 


ES 
Methods 
44. Consider a Poisson distribution with u = 3. 
a. Write the appropriate Poisson probability function. 
b. Compute f(2). 
c. Compute f(1). 
d. Compute P(x = 2). 
45. Consider a Poisson distribution with a mean of two occurrences per time period. 
a. Write the appropriate Poisson probability function. 
b. What is the expected number of occurrences in three time periods? 
c. Write the appropriate Poisson probability function to determine the probability of x 
occurrences in three time periods. 
d. Compute the probability of two occurrences in one time period. 
Compute the probability of six occurrences in three time periods. 
f. Compute the probability of five occurrences in two time periods. 


@ 
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Applications 


46. 


47. 


48. 


49. 


50. 


ot. 


Regional Airways Calls. Phone calls arrive at the rate of 48 per hour at the reservation 

desk for Regional Airways. 

a. Compute the probability of receiving three calls in a 5-minute interval of time. 

b. Compute the probability of receiving exactly 10 calls in 15 minutes. 

c. Suppose no calls are currently on hold. If the agent takes 5 minutes to complete the 
current call, how many callers do you expect to be waiting by that time? What is the 
probability that none will be waiting? 

d. If no calls are currently being processed, what is the probability that the agent can 
take 3 minutes for personal time without being interrupted by a call? 

911 Calls. Emergency 911 calls to a small municipality in Idaho come in at the rate of 

one every 2 minutes. 

a. What is the expected number of 911 calls in one hour? 

b. What is the probability of three 911 calls in five minutes? 

c. What is the probability of no 911 calls in a five-minute period? 

Motor Vehicle Accidents in New York City. In a one-year period, New York City 

had a total of 11,232 motor vehicle accidents that occurred on Monday through 

Friday between the hours of 3 P.M. and 6 P.M. (New York State Department of Motor 

Vehicles website). This corresponds to a mean of 14.4 accidents per hour. 

a. Compute the probability of no accidents in a 15-minute period. 

b. Compute the probability of at least one accident in a 15-minute period. 

c. Compute the probability of four or more accidents in a 15-minute period. 

Airport Passenger-Screening Facility. Airline passengers arrive randomly and inde- 

pendently at the passenger-screening facility at a major international airport. The mean 

arrival rate is 10 passengers per minute. 

a. Compute the probability of no arrivals in a one-minute period. 

b. Compute the probability that three or fewer passengers arrive in a one-minute 
period. 

c. Compute the probability of no arrivals in a 15-second period. 

d. Compute the probability of at least one arrival in a 15-second period. 

Tornadoes in Colorado. According to the National Oceanic and Atmospheric Ad- 

ministration (NOAA), the state of Colorado averages 18 tornadoes every June (NOAA 

website). (Note: There are 30 days in June.) 

a. Compute the mean number of tornadoes per day. 

b. Compute the probability of no tornadoes during a day. 

c. Compute the probability of exactly one tornado during a day. 

d. Compute the probability of more than one tornado during a day. 

E-mails Received. According to a 2017 survey conducted by the technology mar- 

ket research firm The Radicati Group, U.S. office workers receive an average of 121 

e-mails per day (Entrepreneur magazine website). Assume the number of e-mails re- 

ceived per hour follows a Poisson distribution and that the average number of e-mails 
received per hour is five. 

. What is the probability of receiving no e-mails during an hour? 

b. What is the probability of receiving at least three e-mails during an hour? 

c. What is the expected number of e-mails received during 15 minutes? 

d. What is the probability that no e-mails are received during 15 minutes? 


fet) 


5.7 Hypergeometric Probability Distribution 


The hypergeometric probability distribution is closely related to the binomial distribution. 
The two probability distributions differ in two key ways. With the hypergeometric distribu- 
tion, the trials are not independent, and the probability of success changes from trial to trial. 


In the usual notation for the hypergeometric distribution, r denotes the number of 


elements in the population of size N labeled success, and N — r denotes the number of 
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elements in the population labeled failure. The hypergeometric probability function is 
used to compute the probability that in a random selection of n elements, selected without 
replacement, we obtain x elements labeled success and n — x elements labeled failure. For 
this outcome to occur, we must obtain x successes from the r successes in the population 
and n — x failures from the N — r failures. The following hypergeometric probability func- 
tion provides f(x), the probability of obtaining x successes in n trials. 


HYPERGEOMETRIC PROBABILITY FUNCTION 


(5.16) 


where 


x = the number of successes 
n = the number of trials 
fx) = the probability of x successes in n trials 
N = the number of elements in the population 
r = the number of elements in the population labeled success 


N 
Note that ( ) represents the number of ways n elements can be selected from a 
n 


: : r 
population of size N; ( represents the number of ways that x successes can be selected 
x 


N-r 


from a total of r successes in the population; and ( ) represents the number of ways 


n-Xx 
that n — x failures can be selected from a total of N — r failures in the population. 

For the hypergeometric probability distribution, x is a discrete random variable and 
the probability function f(x) given by equation (5.16) is usually applicable for values of 
x = 0, 1,2,...,n. However, only values of x where the number of observed successes 
is less than or equal to the number of successes in the population (x = r) and where the 
number of observed failures is less than or equal to the number of failures in the pop- 
ulation (n — x S N — r) are valid. If these two conditions do not hold for one or more 
values of x, the corresponding f(x) = 0 indicating that the probability of this value of x 
is zero. 

To illustrate the computations involved in using equation (5.16), let us consider the 
following quality control application. Electric fuses produced by Ontario Electric are 
packaged in boxes of 12 units each. Suppose an inspector randomly selects 3 of the 12 
fuses in a box for testing. If the box contains exactly 5 defective fuses, what is the prob- 
ability that the inspector will find exactly 1 of the 3 fuses defective? In this application, 

n = 3 and N = 12. With r = 5 defective fuses in the box the probability of finding x = 1 


defective fuse is 
S/A (52 
1/\2 1!4!7/\2!5! — (5)(21) | 


ji) =e Ss = 4773 


12! 220 
319! 
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Now suppose that we wanted to know the probability of finding at least 1 defective 
fuse. The easiest way to answer this question is to first compute the probability that the 
inspector does not find any defective fuses. The probability of x = 0 is 


(C3) _ a) tas 


0) =—- Oooo 9 
fO) a 530 

319! 
With a probability of zero defective fuses f(0) = .1591, we conclude that the probability 
of finding at least 1 defective fuse must be 1 — .1591 = .8409. Thus, there is a reasonably 


high probability that the inspector will find at least 1 defective fuse. 
The mean and variance of a hypergeometric distribution are as follows. 


N 


r r\(N-n 
Var(x) = 0 = (Z)(1 z) ( — r) (5.18) 


In the preceding example n = 3, r = 5, and N = 12. Thus, the mean and variance for the 
number of defective fuses are 


N- 5 5 12:=3 
e=ni—\(1- ")=3 i= = 60 
N N/\N-1 12 12/\12-1 
The standard deviation is o = V .60 = .77. 


Using Excel to Compute Hypergeometric Probabilities 


The Excel function for computing hypergeometric probabilities is HY PGEOM.DIST. It has 
five inputs: the first is the value of x, the second is the value of n, the third is the value of r, the 
fourth is the value of N, and the fifth is FALSE or TRUE. We choose FALSE if a probability is 
desired and TRUE if a cumulative probability is desired. This function’s usage is similar to that 
of BINOM.DIST for the binomial distribution and POISSON.DIST for the Poisson distribution, 
so we dispense with showing a worksheet figure and just explain how to use the function. 

Let us reconsider the example of selecting 3 fuses for inspection from a fuse box 
containing 12 fuses, 5 of which are defective. We want to find the probability that 1 of the 
3 fuses selected is defective. In this case, the five inputs are x = 1,n = 3, r= 5, N = 12, 
and FALSE. So, the appropriate formula to place in a cell of an Excel worksheet is 
=HYPGEOM.DIST(1,3,5,12, FALSE). Placing this formula in a cell of an Excel worksheet 
provides a hypergeometric probability of .4773. 

If we want to know the probability that none of the 3 fuses selected is defective, 
the five function inputs are x = 0, n = 3, r = 5, N = 12, and FALSE. So, using the 
HYPGEOM.DIST function to compute the probability of randomly selecting 3 fuses 
without any being defective, we would enter the following formula into an Excel work- 
sheet: =HYPGEOM. DIST(0,3,5,12, FALSE). The probability is .1591. 

Cumulative probabilities can be obtained in a similar fashion by using TRUE for 
the fifth input. For instance, to compute the probability of finding at most 1 defective 
fuse, the appropriate formula is =HYPGEOM.DIST(1,3,5,12, TRUE). Placing this for- 
mula in a cell of an Excel worksheet provides a hypergeometric cumulative probability 
of .6364. 


E(x) == (2) (5.17) 
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NOTES + COMMENTS 


Consider a hypergeometric distribution with n trials. Let 
p = (r/N) denote the probability of a success on the first 
trial. If the population size is large, the term (N — n)/(N — 1) 
in equation (5.18) approaches 1. As a result, the expec- 
ted value and variance can be written E(x) = np and 


Var(x) = np(1 — p). Note that these expressions are the 


same as the expressions used to compute the expected 
value and variance of a binomial distribution, as in equa- 
tions (5.13) and (5.14). When the population size is large, 
a hypergeometric distribution can be approximated by a 
binomial distribution with n trials and a probability of suc- 
cess p = (r/N). 


EXERCISES 


Methods 

52. Suppose N = 10 and r = 3. Compute the hypergeometric probabilities for the follow- 
ing values of n and x. 
aon=4,x=1. 


b. n=2,x=2. 
c.n=2,x=0. 
d.n=4,x =2. 
en=4,x=4. 


53. Suppose N = 15 and r = 4. What is the probability of x = 3 for n = 10? 


Applications 

54. Online Holiday Shopping. More and more shoppers prefer to do their holiday 
shopping online from companies such as Amazon. Suppose we have a group of 10 
shoppers; 7 prefer to do their holiday shopping online and 3 prefer to do their holiday 
shopping in stores. A random sample of 3 of these 10 shoppers is selected for a more 
in-depth study of how the economy has impacted their shopping behavior. 

a. What is the probability that exactly 2 prefer shopping online? 
b. What is the probability that the majority (either 2 or 3) prefer shopping online? 

55. Playing Blackjack. Blackjack, or twenty-one as it is frequently called, is a popu- 
lar gambling game played in casinos. A player is dealt two cards. Face cards (jacks, 
queens, and kings) and tens have a point value of 10. Aces have a point value of 1 or 
11. A 52-card deck contains 16 cards with a point value of 10 (jacks, queens, kings, 
and tens) and four aces. 

a. What is the probability that both cards dealt are aces or 10-point cards? 

b. What is the probability that both of the cards are aces? 

c. What is the probability that both of the cards have a point value of 10? 

d. A blackjack is a 10-point card and an ace for a value of 21. Use your answers to 
parts (a), (b), and (c) to determine the probability that a player is dealt blackjack. 
(Hint: Part (d) is not a hypergeometric problem. Develop your own logical relation- 
ship as to how the hypergeometric probabilities from parts (a), (b), and (c) can be 
combined to answer this question.) 

56. Computer Company Benefits Questionnaire. Axline Computers manufactures 
personal computers at two plants, one in Texas and the other in Hawaii. The Texas 
plant has 40 employees; the Hawaii plant has 20. A random sample of 10 employees is 
asked to fill out a benefits questionnaire. 

a. What is the probability that none of the employees in the sample work at the plant 
in Hawaii? 

b. What is the probability that 1 of the employees in the sample works at the plant in 
Hawaii? 

c. What is the probability that 2 or more of the employees in the sample work at the 
plant in Hawaii? 

d. What is the probability that 9 of the employees in the sample work at the plant in Texas? 
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57. Business Meal Reimbursement. The Zagat Restaurant Survey provides food, decor, 
and service ratings for some of the top restaurants across the United States. For 15 
restaurants located in Boston, the average price of a dinner, including one drink and 
tip, was $48.60. You are leaving on a business trip to Boston and will eat dinner at 
three of these restaurants. Your company will reimburse you for a maximum of $50 per 
dinner. Business associates familiar with these restaurants have told you that the meal 
cost at one-third of these restaurants will exceed $50. Suppose that you randomly 
select three of these restaurants for dinner. 

a. What is the probability that none of the meals will exceed the cost covered by your 
company? 

b. What is the probability that one of the meals will exceed the cost covered by your 
company? 

c. What is the probability that two of the meals will exceed the cost covered by your 
company? 

d. What is the probability that all three of the meals will exceed the cost covered by 
your company? 

58. TARP Funds. The Troubled Asset Relief Program (TARP), passed by the U.S. 
Congress in October 2008, provided $700 billion in assistance for the struggling U.S. 
economy. Over $200 billion was given to troubled financial institutions with the hope 
that there would be an increase in lending to help jump-start the economy. But three 
months later, a Federal Reserve survey found that two-thirds of the banks that had 
received TARP funds had tightened terms for business loans (The Wall Street Journal). 
Of the 10 banks that were the biggest recipients of TARP funds, only 3 had actually 
increased lending during this period. 


Increased Lending Decreased Lending 
BB&T Bank of America 

Sun Trust Banks Capital One 

U.S. Bancorp Citigroup 


Fifth Third Bancorp 
J.P. Morgan Chase 
Regions Financial 
Wells Fargo 


For the purposes of this exercise, assume that you will randomly select 3 of these 10 

banks for a study that will continue to monitor bank lending practices. Let x be a ran- 

dom variable indicating the number of banks in the study that had increased lending. 

a. What is f(0)? What is your interpretation of this value? 

b. What is f(3)? What is your interpretation of this value? 

c. Compute f(1) and f(2). Show the probability distribution for the number of banks in 
the study that had increased lending. What value of x has the highest probability? 

d. What is the probability that the study will have at least one bank that had in- 
creased lending? 

e. Compute the expected value, variance, and standard deviation for the random variable. 


SUMMARY 


A random variable provides a numerical description of the outcome of an experiment. The 
probability distribution for a random variable describes how the probabilities are distrib- 
uted over the values the random variable can assume. For any discrete random variable 
x, the probability distribution is defined by a probability function, denoted by f(x), which 
provides the probability associated with each value of the random variable. 

We introduced two types of discrete probability distributions. One type involved provid- 
ing a list of the values of the random variable and the associated probabilities in a table. We 
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showed how the relative frequency method of assigning probabilities could be used to de- 
velop empirical discrete probability distributions of this type. Bivariate empirical distribu- 
tions were also discussed. With bivariate distributions, interest focuses on the relationship 
between two random variables. We showed how to compute the covariance and correlation 
coefficient as measures of such a relationship. We also showed how bivariate distributions 
involving market returns on financial assets could be used to create financial portfolios. 

The second type of discrete probability distribution we discussed involved the use of a 
mathematical function to provide the probabilities for the random variable. The binomial, 
Poisson, and hypergeometric distributions discussed were all of this type. The binomial 
distribution can be used to determine the probability of x successes in n trials whenever the 
experiment has the following properties: 


1. The experiment consists of a sequence of n identical trials. 

2. Two outcomes are possible on each trial, one called success and the other failure. 

3. The probability of a success p does not change from trial to trial. Consequently, the 
probability of failure, 1 — p, does not change from trial to trial. 

4. The trials are independent. 


When the four properties hold, the binomial probability function can be used to determine 
the probability of obtaining x successes in n trials. Formulas were also presented for the 
mean and variance of the binomial distribution. 

The Poisson distribution is used when it is desirable to determine the probability of 
obtaining x occurrences over an interval of time or space. The following assumptions are 
necessary for the Poisson distribution to be applicable: 


1. The probability of an occurrence of the event is the same for any two intervals of 
equal length. 

2. The occurrence or nonoccurrence of the event in any interval is independent of the 
occurrence or nonoccurrence of the event in any other interval. 


A third discrete probability distribution, the hypergeometric, was introduced in Sec- 
tion 5.7. Like the binomial, it is used to compute the probability of x successes in n trials. 
But, in contrast to the binomial, the probability of success changes from trial to trial. 


GLOSSARY 
Binomial experiment An experiment having the four properties stated at the beginning of 
Section 5.5. 

Binomial probability distribution A probability distribution showing the probability of x 
successes in 7 trials of a binomial experiment. 

Binomial probability function The function used to compute binomial probabilities. 
Bivariate probability distribution A probability distribution involving two random vari- 
ables. A discrete bivariate probability distribution provides a probability for each pair of 
values that may occur for the two random variables. 

Continuous random variable A random variable that may assume any numerical value in 
an interval or collection of intervals. 

Discrete random variable A random variable that may assume either a finite number of 
values or an infinite sequence of values. 

Discrete uniform probability distribution A probability distribution for which each 
possible value of the random variable has the same probability. 

Empirical discrete distribution A discrete probability distribution for which the relative 
frequency method is used to assign the probabilities. 

Expected value A measure of the central location, or mean, of a random variable. 
Hypergeometric probability distribution A probability distribution showing the 
probability of x successes in n trials from a population with r successes and N — r 
failures. 
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Hypergeometric probability function The function used to compute hypergeometric 
probabilities. 

Poisson probability distribution A probability distribution showing the probability of x 
occurrences of an event over a specified interval of time or space. 

Poisson probability function The function used to compute Poisson probabilities. 
Probability distribution A description of how the probabilities are distributed over the 
values of the random variable. 

Probability function A function, denoted by f(x), that provides the probability that x 
assumes a particular value for a discrete random variable. 

Random variable A numerical description of the outcome of an experiment. 
Standard deviation The positive square root of the variance. 

Variance A measure of the variability, or dispersion, of a random variable. 


KEY FORMULAS | eee eee 


Discrete Uniform Probability Function 


F(x) = In (5.3) 
Expected Value of a Discrete Random Variable 
E(x) = u = Daf (x) (5.4) 
Variance of a Discrete Random Variable 
Var(x) = P = E(x — Wf) (5.5) 
Covariance of Random Variables x and y 
c= [Var(x + y) — Var(x) — Var(y)]/2 (5.6) 
Correlation between Random Variables x and y 
Oy 
Pry = oO, (5.7) 
Expected Value of a Linear Combination of Random Variables x and y 
E(ax + by) = aE(x) + bE(y) (5.8) 
Variance of a Linear Combination of Two Random Variables 
Var(ax + by) = a?Var(x) + b’Var(y) + 2abo,,, (5.9) 
where Oxy is the covariance of x and y 
Number of Experimental Outcomes Providing Exactly x Successes in n Trials 
! 
$ ) = TETT (5.10) 
Binomial Probability Function 
f@) = (e)ra =p) (5.12) 
Expected Value for the Binomial Distribution 
E(x) = u = np (5.13) 
Variance for the Binomial Distribution 
Var(x) = œ = np(1 — p) (5.14) 
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Poisson Probability Function 


pire " 
f(x) =— (5.15) 
x! 
Hypergeometric Probability Function 
r\{N-r 
X n=x 
fo = (5.16) 
N 
Expected Value for the Hypergeometric Distribution 
r 
E(x) = u = n| — 5.17 
(x) =p a( z) (5.17) 


Variance for the Hypergeometric Distribution 


r r N-n 
Var(x) = 0 = (=) (: z) (z — r) (5.18) 


SUPPLEMENTARY EXERCISES 


59. Wind Conditions and Boating Accidents. The U.S. Coast Guard (USCG) provides 
a wide variety of information on boating accidents including the wind condition at the 
time of the accident. The following table shows the results obtained for 4,401 acci- 


dents (USCG website). 
Percentage of 
Wind Condition Accidents 
None 9.6 
Light 57.0 
Moderate 23.8 
Strong Val 
Storm 19 


Let x be a random variable reflecting the known wind condition at the time of each 
accident. Set x = 0 for none, x = 1 for light, x = 2 for moderate, x = 3 for strong, and 
x = 4 for storm. 

. Develop a probability distribution for x. 

. Compute the expected value of x. 

. Compute the variance and standard deviation for x. 

. Comment on what your results imply about the wind conditions during boating 

accidents. 

60. Wait Times at Car Repair Garages. The Car Repair Ratings website provides 
consumer reviews and ratings for garages in the United States and Canada. The 
time customers wait for service to be completed is one of the categories rated. The 
following table provides a summary of the wait-time ratings (1 = Slow/Delays; 
10 = Quick/On Time) for 40 randomly selected garages located in the province of 
Ontario, Canada. 


fo) 


aoao g 
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Wait-Time Rating Number of Garages 
1 6 
2 2 
3 5 
4 2 
5 5 
6 2 
7 4 
8 5 
9 5 

10 6 


a. Develop a probability distribution for x = wait-time rating. 

b. Any garage that receives a wait-time rating of at least 9 is considered to provide 
outstanding service. If a consumer randomly selects one of the 40 garages for their 
next car service, what is the probability the garage selected will provide outstanding 
wait-time service? 

c. What is the expected value and variance for x? 

d. Suppose that seven of the 40 garages reviewed were new car dealerships. Of the 
seven new car dealerships, two were rated as providing outstanding wait-time 
service. Compare the likelihood of a new car dealership achieving an outstanding 
wait-time service rating as compared to other types of service providers. 

61. Expense Forecasts. The budgeting process for a midwestern college resulted in 
expense forecasts for the coming year (in $ millions) of $9, $10, $11, $12, and $13. 
Because the actual expenses are unknown, the following respective probabilities are 
assigned: .3, .2, .25, .05, and .2. 

a. Show the probability distribution for the expense forecast. 

b. What is the expected value of the expense forecast for the coming year? 

c. What is the variance of the expense forecast for the coming year? 

d. If income projections for the year are estimated at $12 million, comment on the 
financial position of the college. 

62. Bookstore Customer Purchases. A bookstore at the Hartsfield-Jackson Airport in 
Atlanta sells reading materials (paper-back books, newspapers, magazines) as well as 
snacks (peanuts, pretzels, candy, etc.). A point-of-sale terminal collects a variety of infor- 
mation about customer purchases. The following table shows the number of snack items 
and the number of items of reading material purchased by the most recent 600 customers. 


Reading Material 
Snacks 0 1 2 


a. Using the data in the table construct an empirical discrete bivariate probability distribu- 
tion for x = number of snack items and y = number of reading materials for a randomly 
selected customer purchase. What is the probability of a customer purchase consisting 
of one item of reading materials and two snack items? What is the probability of a cus- 
tomer purchasing one snack item only? Why is the probability f(x = 0, y = 0) = 0? 
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63. 


64. 


65. 


b. Show the marginal probability distribution for the number of snack items pur- 
chased. Compute the expected value and variance. 

c. What is the expected value and variance for the number of reading materials pur- 
chased by a customer? 

d. Show the probability distribution for £ = total number of items for a randomly 
selected customer purchase. Compute its expected value and variance. 

e. Compute the covariance and correlation coefficient between x and y. What is the 
relationship, if any, between the number of reading materials and number of snacks 
purchased? 

Creating a Diversified Investment Portfolio. The Knowles/Armitage (KA) group 

at Merrill Lynch advises clients on how to create a diversified investment portfolio. 

One of the investment alternatives they make available to clients is the All World Fund 

composed of global stocks with good dividend yields. One of their clients is interested 

in a portfolio consisting of investment in the All World Fund and a treasury bond fund. 

The expected percent return of an investment in the All World Fund is 7.80% with a 

standard deviation of 18.90%. The expected percent return of an investment in a treas- 

ury bond fund is 5.50% and the standard deviation is 4.60%. The covariance of an 

investment in the All World Fund with an investment in a treasury bond fund is — 12.4. 

a. Which of the funds would be considered the more risky? Why? 

b. If KA recommends that the client invest 75% in the All World Fund and 25% in 
the treasury bond fund, what is the expected percent return and standard deviation 
for such a portfolio? What would be the expected return and standard deviation, in 
dollars, for a client investing $10,000 in such a portfolio? 

c. If KA recommends that the client invest 25% in the All World Fund and 75% in the 
treasury bond fund, what is the expected return and standard deviation for such a 
portfolio? What would be the expected return and standard deviation, in dollars, for 
a client investing $10,000 in such a portfolio? 

d. Which of the portfolios in parts (b) and (c) would you recommend for an ag- 
gressive investor? Which would you recommend for a conservative investor? 
Why? 

Giving Up Technology. A Pew Research Center survey asked adults in the United 

States which technologies would be “very hard” to give up. The following re- 

sponses were obtained: Internet 53%, smartphone 49%, e-mail 36%, and land-line 

phone 28% (USA Today website). 

a. If 20 adult Internet users are surveyed, what is the probability that 3 users will 
report that it would be very hard to give it up? 

b. If 20 adults who own a land-line phone are surveyed, what is the probability that 
5 or fewer will report that it would be very hard to give it up? 

c. If 2,000 owners of smartphones were surveyed, what is the expected number that 
will report that it would be very hard to give it up? 

d. If 2,000 users of e-mail were surveyed, what the expected number that will re- 
port that it would be very hard to give it up? What is the variance and standard 
deviation? 

Investing in the Stock Market. According to a 2017 Gallup survey, the percentage 

of individuals in the United States who are invested in the stock market by age is as 

shown in the following table (Gallup website). 


Percent of Individuals 


Age Range Invested in Stock Market 
18 to 29 sil 
30 to 49 62 
50 to 64 62 
65+ 54 
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Suppose Gallup wishes to complete a follow-up survey to find out more about the 

specific type of stocks people in the United States are purchasing. 

a. How many 18- to 29-year-olds must be sampled to find at least 50 who invest in the 
stock market? 

b. How many people 65 years of age and older must be sampled to find at least 50 
who invest in the stock market? 

c. If 1,000 individuals are randomly sampled, what is the expected number of 18- to 
29-year-olds who invest in the stock market in this sample? What is the standard 
deviation of the number of 18- to 29-year-olds who invest in the stock market? 

d. If 1,000 individuals are randomly sampled, what is the expected number of those 65 years 
of age and older who invest in the stock market in this sample? What is the standard devi- 
ation of the number of those 65 years of age and older who invest in the stock market? 

66. Acceptance Sampling. Many companies use a quality control technique called ac- 
ceptance sampling to monitor incoming shipments of parts, raw materials, and so on. 
In the electronics industry, component parts are commonly shipped from suppliers in 
large lots. Inspection of a sample of n components can be viewed as the n trials of a 
binomial experiment. The outcome for each component tested (trial) will be that the 
component is classified as good or defective. Reynolds Electronics accepts a lot from a 
particular supplier if the defective components in the lot do not exceed 1%. Suppose a 
random sample of five items from a recent shipment is tested. 

a. Assume that 1% of the shipment is defective. Compute the probability that no items 
in the sample are defective. 

b. Assume that 1% of the shipment is defective. Compute the probability that exactly 
one item in the sample is defective. 

c. What is the probability of observing one or more defective items in the sample if 
1% of the shipment is defective? 

d. Would you feel comfortable accepting the shipment if one item was found to be 
defective? Why or why not? 

67. Americans with at Least a Two-Year Degree. PBS News Hour reported in 2014 that 
39.4% of Americans between the ages of 25 and 64 have at least a two-year college 
degree (PBS website). Assume that 50 Americans between the ages of 25 and 64 
are selected randomly. 

a. What is the expected number of people with at least a two-year college degree? 

b. What are the variance and standard deviation for the number of people with at least 
a two-year college degree? 

68. Choosing a Home Builder. Mahoney Custom Home Builders, Inc. of Canyon Lake, 
Texas, asked visitors to their website what is most important when choosing a home 
builder. Possible responses were quality, price, customer referral, years in business, 
and special features. Results showed that 23.5% of the respondents chose price as 
the most important factor (Mahoney Custom Homes website). Suppose a sample of 
200 potential home buyers in the Canyon Lake area was selected. 

a. How many people would you expect to choose price as the most important factor 
when choosing a home builder? 

b. What is the standard deviation of the number of respondents who would choose 
price as the most important factor in selecting a home builder? 

c. What is the standard deviation of the number of respondents who do not list price 
as the most important factor in selecting a home builder? 

69. Arrivals to a Car Wash. Cars arrive at a car wash randomly and independently; the 
probability of an arrival is the same for any two time intervals of equal length. The 
mean arrival rate is 15 cars per hour. What is the probability that 20 or more cars 
will arrive during any given hour of operation? 

70. Production Process Breakdowns. A new automated production process averages 1.5 break- 
downs per day. Because of the cost associated with a breakdown, management is concerned 
about the possibility of having 3 or more breakdowns during a day. Assume that breakdowns 
occur randomly, that the probability of a breakdown is the same for any two time intervals 
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of equal length, and that breakdowns in one period are independent of breakdowns in other 
periods. What is the probability of having 3 or more breakdowns during a day? 

71. Small Business Failures. A regional director responsible for business development 
in the state of Pennsylvania is concerned about the number of small business failures. 
If the mean number of small business failures per month is 10, what is the probability 
that exactly 4 small businesses will fail during a given month? Assume that the prob- 
ability of a failure is the same for any two months and that the occurrence or nonoc- 
currence of a failure in any month is independent of failures in any other month. 

72. Bank Customer Arrivals. Customer arrivals at a bank are random and independent; 
the probability of an arrival in any one-minute period is the same as the probabil- 
ity of an arrival in any other one-minute period. Answer the following questions, 
assuming a mean arrival rate of 3 customers per minute. 

a. What is the probability of exactly 3 arrivals in a one-minute period? 
b. What is the probability of at least 3 arrivals in a one-minute period? 

73. Poker Hands. A deck of playing cards contains 52 cards, four of which are aces. What 
is the probability that the deal of a five-card poker hand provides 
a. A pair of aces? 

b. Exactly one ace? 
c. No aces? 
d. At least one ace? 

74. Business School Student GPAs. According to U.S. News & World Reports, 7 of the 
top 10 graduate schools of business have students with an average undergraduate grade 
point average (GPA) of 3.50 or higher. Suppose that we randomly select 2 of the top 
10 graduate schools of business. 

a. What is the probability that exactly one school has students with an average under- 
graduate GPA of 3.50 or higher? 

b. What is the probability that both schools have students with an average undergradu- 
ate GPA of 3.50 or higher? 

c. What is the probability that neither school has students with an average undergradu- 
ate GPA of 3.50 or higher? 


CASE PROBLEM 1: GO BANANAS! 
BREAKFAST CEREAL 
Great Grasslands Grains, Inc. (GGG) manufactures and sells a wide variety of breakfast 
cereals. GGG’s product development lab recently created a new cereal that consists of rice 
flakes and banana-flavored marshmallows. The company’s marketing research department 
has tested the new cereal extensively and has found that consumers are enthusiastic about 
the cereal when 16-ounce boxes contain at least 1.6 ounces and no more than 2.4 ounces of 
the banana-flavored marshmallows. 

As GGG prepares to begin producing and selling 16-ounce boxes of the new cereal, 
which it has named Go Bananas!, management is concerned about the amount of ba- 
nana-flavored marshmallows. It wants to be careful not to include less than 1.6 ounces 
or more than 2.4 ounces of banana-flavored marshmallows in each 16-ounce box of Go 
Bananas’. Tina Finkel, VP of Production for GGG, has suggested that the company mea- 
sure the weight of banana-flavored marshmallows in a random sample of 25 boxes of Go 
Bananas! on a weekly basis. Each week, GGG can count the number of boxes out of the 
25 boxes in the sample that contain less than 1.6 ounces or more than 2.4 ounces of ba- 
nana-flavored marshmallows; if the number of boxes that fail to meet the standard weight 
of banana-flavored marshmallows is too high, production will be shut down and inspected. 

Ms. Finkel and her staff have designed the production process so that only 8% of all 
16-ounce boxes of Go Bananas! fail to meet the standard weight of banana-flavored marsh- 
mallows. After much debate, GGG management has decided to shut down production of 
Go Bananas! if at least five boxes in a weekly sample fail to meet the standard weight of 
banana-flavored marshmallows. 
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Managerial Report 
Prepare a managerial report that addresses the following issues. 


1. Calculate the probability that a weekly sample will result in a shutdown of production 
if the production process is working properly. Comment on GGG management’s policy 
for deciding when to shut down production of Go Bananas!. 

2. GGG management wants to shut down production of Go Bananas! no more than 1% 
of the time when the production process is working properly. Suggest the appropriate 
number of boxes in the weekly sample that must fail to meet the standard weight of 
banana-flavored marshmallows in order for production to be shut down if this goal is to 
be achieved. 

3. Ms. Finkel has suggested that if given sufficient resources, she could redesign the pro- 
duction process to reduce the percentage of 16-ounce boxes of Go Bananas’ that fail to 
meet the standard weight of banana-flavored marshmallows when the process is working 
properly. To what level must Ms. Finkel reduce the percentage of 16-ounce boxes of Go 
Bananas! that fail to meet the standard weight of banana-flavored marshmallows when 
the process is working properly in order for her to reduce the probability at least five of 
the sampled boxes fail to meet the standard to .01 or less? 


CASE PROBLEM 2: McNEIL’S AUTO MALL 


eeeoeeeeese 


Harriet McNeil, proprietor of McNeil’s Auto Mall, believes that it is good business for her 
automobile dealership to have more customers on the lot than can be served, as she be- 
lieves this creates an impression that demand for the automobiles on her lot is high. How- 
ever, she also understands that if there are far more customers on the lot than can be served 
by her salespeople, her dealership may lose sales to customers who become frustrated and 
leave without making a purchase. 

Ms. McNeil is primarily concerned about the staffing of salespeople on her lot on 
Saturday mornings (8:00 A.M. to noon), which are the busiest time of the week for 
McNeil’s Auto Mall. On Saturday mornings, an average of 6.8 customers arrive per hour. 
The customers arrive randomly at a constant rate throughout the morning, and a salesper- 
son spends an average of one hour with a customer. Ms. McNeil’s experience has led her to 
conclude that if there are two more customers on her lot than can be served at any time on 
a Saturday morning, her automobile dealership achieves the optimal balance of creating an 
impression of high demand without losing too many customers who become frustrated and 
leave without making a purchase. 

Ms. McNeil now wants to determine how many salespeople she should have on her lot 
on Saturday mornings in order to achieve her goal of having two more customers on her lot 
than can be served at any time. She understands that occasionally the number of customers 
on her lot will exceed the number of salespersons by more than two, and she is willing to 
accept such an occurrence no more than 10% of the time. 


Managerial Report 

Ms. McNeil has asked you to determine the number of salespersons she should have on 
her lot on Saturday mornings in order to satisfy her criteria. In answering Ms. McNeil’s 
question, consider the following three questions: 


1. How is the number of customers who arrive on the lot on a Saturday morning distributed? 

2. Suppose Ms. McNeil currently uses five salespeople on her lot on Saturday mornings. 
Using the probability distribution you identified in (1), what is the probability that the 
number of customers who arrive on her lot will exceed the number of salespersons by 
more than two? Does her current Saturday morning employment strategy satisfy her 
stated objective? Why or why not? 

3. What is the minimum number of salespeople Ms. McNeil should have on her lot on 
Saturday mornings to achieve her objective? 
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CASE PROBLEM 3: GRIEVANCE COMMITTEE 


Several years ago, management at Tuglar Corporation established a grievance committee 
composed of employees who volunteered to work toward the amicable resolution of dis- 
putes between Tuglar management and its employees. Each year management issues a call 
for volunteers to serve on the grievance committee, and 10 of the respondents are randomly 
selected to serve on the committee for the upcoming year. 

Employees in the Accounting Department are distressed because no member of their 
department has served on the Tuglar grievance committee in the past five years. Manage- 
ment has assured its employees in the Accounting Department that the selections have been 
made randomly, but these assurances have not quelled suspicions that management has 
intentionally omitted accountants from the committee. The table below summarizes the 
total number of volunteers and the number of employees from the Accounting Department 
who have volunteered for the grievance committee in each of the past five years: 


Year 1 Year2 Year3 Year4 Year5 


Total Number of Volunteers 29 il 23 26 28 
Number of Volunteers from 1 1 1 2 1 
the Accounting Department 


In its defense, management has provided these numbers to the Accounting Department. 
Given these numbers, is the lack of members of the Accounting Department on the griev- 
ance committee for the past five years suspicious (i.e., unlikely)? 


Managerial Report 
In addressing the issue of whether or not the committee selection process is random, con- 
sider the following questions: 


1. How is the number of members of the Accounting Department who are selected to serve 
on the grievance committee distributed? 

2. Using the probability distribution you identified in (1), what is the probability for each 
of these five years that no members of the Accounting Department have been selected to 
serve? 

3. Using the probabilities you identified in (2), what is the probability that no members 
of the Accounting Department have been selected to serve during the past five years? 

4. What is the cause of the lack of Accounting Department representation on the grievance 
committee over the past five years? What can be done to increase the probability that a 
member of the Accounting Department will be selected to serve on the grievance com- 
mittee using the current selection method? 


CASE PROBLEM 4: SAGITTARIUS CASINO 
The Sagittarius Casino’s strategy for establishing a competitive advantage over its compet- 
itors is to periodically create unique and interesting new games for its customers to play. 
Sagittarius management feels it is time for the casino to once again introduce a new game to 
excite its customer base, and Sagittarius’s Director of Research and Development, Lou Zerbit, 
believes he and his staff have developed a new game that will accomplish this goal. The 
game, which they have named POSO! (an acronym for Payouts On Selected Outcomes), 

is to be played in the following manner. A player will select two different values from 1, 2, 
3, 4, 5, and 6. Two dice are then rolled. If the first number the player selected comes up on 
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at least one of the two dice, the player wins $5.00; if the second number the player selected 
comes up on both of the dice, the player wins $10.00. If neither of these events occurs, the 
player wins nothing. 

For example, suppose a player fills out the following card for one game of POSO! 


e 

POSO! 

First Number (select one number from 1, 2, Second Number (select a different number 
3, 4, 5, or 6) from 1, 2, 3, 4, 5, or 6) 

If this number comes up on at least one die, | If this number comes up on both dice, you 
you win $5.00! win $10.00! 


When the two dice are rolled, if at least one die comes up 4 the player will win $5.00, if 
both dice come up 2 the player will win $10.00, and if any other outcome occurs the player 
wins nothing. 


Managerial Report 
Sagittarius management now has three questions about POSO! These questions should be 
addressed in your report. 


1. Although they certainly do not want to pay out more than they take in, casinos like to 
offer games in which players win frequently; casino managers believe this keeps players 
excited about playing the game. What is the probability a player will win if she or he 
plays a single game of POSO!? 

2. What is the expected amount a player will win when playing one game of POSO!? 

3. Sagittarius managers want to take in more than they pay out on average for a game 
of POSO!. Furthermore, casinos such as Sagittarius are often looking for games that 
provide their gamers with an opportunity to play for a small bet, and Sagittarius man- 
agement would like to charge players $2.00 to play one game of POSO!. What will be 
the expected profit earned by Sagittarius Casino on a single play if a player has to pay 
$2.00 for a single play of POSO!? Will Sagittarius Casino expect to earn or lose money 
on POSO! if a player pays $2.00 for a single play? What is the minimum amount Sa- 
gittarius Casino can charge a player for a single play of POSO? and still expect to earn 
money? 
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Chapter 6 Continuous Probability Distributions 


STATISTICS IN PRACTICE 


Procter & Gamble* 
CINCINNATI, OHIO 


Procter & Gamble (P&G) produces and markets such 
products as detergents, disposable diapers, razors, 
toothpastes, soaps, mouthwashes, and paper towels. 
Worldwide, it has the leading brand in more categories 
than any other consumer products company. 

As a leader in the application of statistical methods 
in decision making, P&G employs people with diverse 
academic backgrounds: engineering, statistics, opera- 
tions research, analytics and business. The major quan- 
titative technologies for which these people provide 
support are probabilistic decision and risk analysis, ad- 
vanced simulation, quality improvement, and quantita- 
tive methods (e.g., linear programming, data analytics, 
probability analysis, machine learning). 

The Industrial Chemicals Division of P&G is a major 
supplier of fatty alcohols derived from natural substances 
such as coconut oil and from petroleum-based deriva- 
tives. The division wanted to know the economic risks 
and opportunities of expanding its fatty-alcohol produc- 
tion facilities, so it called in P&G's experts in probabilistic 
decision and risk analysis to help. After structuring and 
modeling the problem, they determined that the key to 
profitability was the cost difference between the petroleum- 
and coconut-based raw materials. Future costs were un- 
known, but the analysts were able to approximate them 
with the following continuous random variables. 


x = the coconut oil price per pound of fatty alcohol 
and 


y = the petroleum raw material price per pound 
of fatty alcohol 


Because the key to profitability was the difference 
between these two random variables, a third random 
variable, d = x — y, was used in the analysis. Experts 
were interviewed to determine the probability distri- 
butions for x and y. In turn, this information was used 
to develop a probability distribution for the difference 
in prices d. This continuous probability distribution 
showed a .90 probability that the price difference would 


*The authors are indebted to Joel Kahn of Procter & Gamble for 
providing this Statistics in Practice. 


Procter & Gamble is a leader in the application 
of statistical methods in decision making. © John 
Sommers ll/Reuters 


be $.0655 or less and a .50 probability that the price 
difference would be $.035 or less. In addition, there was 
only a .10 probability that the price difference would be 
$.0045 or less." 

The Industrial Chemicals Division thought that 
being able to quantify the impact of raw material 
price differences was key to reaching a consensus. 
The probabilities obtained were used in a sensitivity 
analysis of the raw material price difference. The 
analysis yielded sufficient insight to form the basis for a 
recommendation to management. 

The use of continuous random variables and 
their probability distributions was helpful to P&G in 
analyzing the economic risks associated with its fatty- 
alcohol production. In this chapter, you will gain an 
understanding of continuous random variables and 
their probability distributions, including one of the 
most important probability distributions in statistics, the 
normal distribution. 


‘The price differences stated here have been modified to protect 
proprietary data. 


In the preceding chapter we discussed discrete random variables and their probability 
distributions. In this chapter we turn to the study of continuous random variables. Specif- 
ically, we discuss three continuous probability distributions: the uniform, the normal, and 


the exponential. 
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A fundamental difference separates discrete and continuous random variables in terms of 
how probabilities are computed. For a discrete random variable, the probability function f(x) 
provides the probability that the random variable assumes a particular value. With continu- 
ous random variables, the counterpart of the probability function is the probability density 
function, also denoted by f(x). The difference is that the probability density function does 
not directly provide probabilities. However, the area under the graph of f(x) corresponding 
to a given interval does provide the probability that the continuous random variable x 
assumes a value in that interval. So when we compute probabilities for continuous random 
variables, we are computing the probability that the random variable assumes any value in 
an interval. 

Because the area under the graph of f(x) at any particular point is zero, one of the 
implications of the definition of probability for continuous random variables is that the 
probability of any particular value of the random variable is zero. In Section 6.1 we demon- 
strate these concepts for a continuous random variable that has a uniform distribution. 

Much of the chapter is devoted to describing and showing applications of the normal 
distribution. The normal distribution is of major importance because of its wide applica- 
bility and its extensive use in statistical inference. The chapter closes with a discussion of 
the exponential distribution. The exponential distribution is useful in applications involving 
such factors as waiting times and service times. 


6.1 Uniform Probability Distribution 


Consider the random variable x representing the flight time of an airplane traveling from 
Chicago to New York. Suppose the flight time can be any value in the interval from 
120 minutes to 140 minutes. Because the random variable x can assume any value in that 
interval, x is a continuous rather than a discrete random variable. Let us assume that sufficient 
actual flight data are available to conclude that the probability of a flight time within any 
1-minute interval is the same as the probability of a flight time within any other 1-minute 
interval contained in the larger interval from 120 to 140 minutes. With every 1-minute 
interval being equally likely, the random variable x is said to have a uniform probability 
distribution. The probability density function, which defines the uniform distribution for 
the flight-time random variable, is 


1/20 for 120 = x = 140 
f(x) = 
0 elsewhere 


Figure 6.1 is a graph of this probability density function. In general, the uniform proba- 
bility density function for a random variable x is defined by the following formula. 


UNIFORM PROBABILITY DENSITY FUNCTION 
1 fora =x=b 


f~=4b-a (6.1) 
(0) elsewhere 


For the flight-time random variable, a = 120 and b = 140. 

As noted in the introduction, for a continuous random variable, we consider proba- 
bility only in terms of the likelihood that a random variable assumes a value within a 
specified interval. In the flight time example, an acceptable probability question is: What 
is the probability that the flight time is between 120 and 130 minutes? That is, what is 
P(120 = x < 130)? Because the flight time must be between 120 and 140 minutes and be- 
cause the probability is described as being uniform over this interval, we feel comfortable 
saying that P(120 = x = 130) = .50. In the following subsection we show that this proba- 
bility can be computed as the area under the graph of f(x) from 120 to 130 (see Figure 6.2). 
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Chapter 6 Continuous Probability Distributions 


FIGURE 6.1 Uniform Probability Distribution for Flight Time 


fx) 


120 125 130 135 140 
Flight Time in Minutes 


FIGURE 6.2 Area Provides Probability of a Flight Time Between 120 


and 130 Minutes 


fœ) 


P(120 =x =130) = Area = 1/20(10) = 10/20 = .50 


120 125 130 135 140 
Flight Time in Minutes 


Area as a Measure of Probability 


Let us make an observation about the graph in Figure 6.2. Consider the area under the 
graph of f(x) in the interval from 120 to 130. The area is rectangular, and the area of a 
rectangle is simply the width multiplied by the height. With the width of the interval equal 
to 130 — 120 = 10 and the height equal to the value of the probability density function 


(x) = 1/20, we have area = width X height = 10(1/20) = 10/20 = .50. 


What observation can you make about the area under the graph of f(x) and probability? 
They are identical! Indeed, this observation is valid for all continuous random variables. 
Once a probability density function f(x) is identified, the probability that x takes a value 
between some lower value x, and some higher value x, can be found by computing the area 
under the graph of f(x) over the interval from x, to x,. 

Given the uniform distribution for flight time and using the interpretation of area as 
probability, we can answer any number of probability questions about flight times. For 
example, what is the probability of a flight time between 128 and 136 minutes? The width 
of the interval is 136 — 128 = 8. With the uniform height of f(x) = 1/20, we see that 
P(128 = x S 136) = 8(1/20) = .40. 

Note that P1120 = x = 140) = 20(1/20) = 1; that is, the total area under the graph of 


f(x) is equal to 1. This property holds for all continuous probability distributions and is 


the analog of the condition that the sum of the probabilities must equal 1 for a discrete 
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To see that the probability 
of any single point is 0, 
refer to Figure 6.2 and 
compute the probability 
of a single point, say, 

= 125. P(x = 125) = 
P(125 < x 125) = 
0(1/20) = 0. 
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probability function. For a continuous probability density function, we must also require 
that f(x) = 0 for all values of x. This requirement is the analog of the requirement that 
fx) = 0 for discrete probability functions. 

Two major differences stand out between the treatment of continuous random variables 
and the treatment of their discrete counterparts. 


1. We no longer talk about the probability of the random variable assuming a particu- 
lar value. Instead, we talk about the probability of the random variable assuming a 
value within some given interval. 

2. The probability of a continuous random variable assuming a value within some 
given interval from x, to x, is defined to be the area under the graph of the probabil- 
ity density function between x, and x,. Because a single point is an interval of zero 
width, this implies that the probability of a continuous random variable assuming 
any particular value exactly is zero. It also means that the probability of a continu- 
ous random variable assuming a value in any interval is the same whether or 
not the endpoints are included. 


The calculation of the expected value and variance for a continuous random variable is 
analogous to that for a discrete random variable. However, because the computational 
procedure involves integral calculus, we leave the derivation of the appropriate formulas to 
more advanced texts. 

For the uniform continuous probability distribution introduced in this section, the for- 
mulas for the expected value and variance are 


E atb 

(x) = z 
2 

var = 2 


In these formulas, a is the smallest value and b is the largest value that the random variable 
may assume. 

Applying these formulas to the uniform distribution for flight times from Chicago to 
New York, we obtain 


(120 + 140) 
E(x) = ——— = 130 
2 
(140 — 120)" 
Var (x) = D = 33.33 


The standard deviation of flight times can be found by taking the square root of the vari- 
ance. Thus, o = 5.77 minutes. 


NOTES + COMMENTS 


To see more clearly why the height of a probability density The height of the probability density function, f{x), is 2 for val- 


function is not a probability, think about a random variable ues of x between 0 and .5. However, we know probabilities 


with the following uniform probability distribution. can never be greater than 1. Thus, we see that f(x) cannot be 
interpreted as the probability of x. 
2 forOsxs.5 
f(x) = 
0 elsewhere 
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EXERCISES 


Methods 

1. The random variable x is known to be uniformly distributed between 1.0 and 1.5. 
a. Show the graph of the probability density function. 
b. Compute P(x = 1.25). 
c. Compute P(1.0 = x = 1.25). 
d. Compute P(1.20 < x < 1.5). 

2. The random variable x is known to be uniformly distributed between 10 and 20. 
a. Show the graph of the probability density function. 
b. Compute P(x < 15). 
c. Compute P(12 = x < 18). 
d. Compute E(x). 
e. Compute Var(x). 


Applications 

3. Cincinnati to Tampa Flight Time. Delta Airlines quotes a flight time of 2 hours, 5 
minutes for its flights from Cincinnati to Tampa. Suppose we believe that actual flight 
times are uniformly distributed between 2 hours and 2 hours, 20 minutes. 

a. Show the graph of the probability density function for flight time. 

b. What is the probability that the flight will be no more than 5 minutes late? 
c. What is the probability that the flight will be more than 10 minutes late? 
d. What is the expected flight time? 

4. Excel RAND Function. Most computer languages include a function that can be used 
to generate random numbers. In Excel, the RAND function can be used to generate 
random numbers between 0 and 1. If we let x denote a random number generated 
using RAND, then x is a continuous random variable with the following probability 
density function. 


fo= 1 forOS=x=1 
‘es 0 elsewhere 


fot) 


. Graph the probability density function. 

b. What is the probability of generating a random number between .25 and .75? 

c. What is the probability of generating a random number with a value less than or 
equal to .30? 

d. What is the probability of generating a random number with a value greater than 
.60? 

e. Generate 50 random numbers by entering =RAND() into 50 cells of an Excel 
worksheet. 

f. Compute the mean and standard deviation for the random numbers in part (e). 

5. Tesla Battery Recharge Time. The electric-vehicle manufacturing company Tesla 
estimates that a driver who commutes 50 miles per day in a Model S will require a 
nightly charge time of around 1 hour and 45 minutes (105 minutes) to recharge the 
vehicle’s battery (Tesla company website). Assume that the actual recharging time 
required is uniformly distributed between 90 and 120 minutes. 

a. Give a mathematical expression for the probability density function of battery 
recharging time for this scenario. 
b. What is the probability that the recharge time will be less than 110 minutes? 
c. What is the probability that the recharge time required is at least 100 minutes? 
d. What is the probability that the recharge time required is between 95 and 
110 minutes? 

6. Daily Discretionary Spending. A Gallup Daily Tracking Survey found that the 
mean daily discretionary spending by Americans earning over $90,000 per year 
was $136 per day. The discretionary spending excluded home purchases, vehicle 
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Abraham de Moivre, a 

French mathematician, 

published The Doctrine 
of Chances in 1733. 

He derived the normal 

distribution. 
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purchases, and regular monthly bills. Let x = the discretionary spending per day 

and assume that a uniform probability density function applies with f(x) = .00625 

fora Sx =b. 

a. Find the values of a and b for the probability density function. 

b. What is the probability that consumers in this group have daily discretionary spend- 
ing between $100 and $200? 

c. What is the probability that consumers in this group have daily discretionary spend- 
ing of $150 or more? 

d. What is the probability that consumers in this group have daily discretionary spend- 
ing of $80 or less? 

7. Bidding on Land. Suppose we are interested in bidding on a piece of land and we 
know one other bidder is interested.' The seller announced that the highest bid in 
excess of $10,000 will be accepted. Assume that the competitor’s bid x is a random 
variable that is uniformly distributed 
between $10,000 and $15,000. 

a. Suppose you bid $12,000. What is the probability that your bid will be accepted? 

b. Suppose you bid $14,000. What is the probability that your bid will be accepted? 

c. What amount should you bid to maximize the probability that you get the property? 

d. Suppose you know someone who is willing to pay you $16,000 for the property. 
Would you consider bidding less than the amount in part (c)? Why or why not? 


6.2 Normal Probability Distribution 


The most commonly used probability distribution for describing a continuous random 
variable is the normal probability distribution. The normal distribution has been used 

in a wide variety of practical applications in which the random variables are heights and 
weights of people, test scores, scientific measurements, amounts of rainfall, and other 
similar values. It is also widely used in statistical inference, which is the major topic of the 
remainder of this book. In such applications, the normal distribution provides a description 
of the likely results obtained through sampling. 


Normal Curve 


The form, or shape, of the normal distribution is illustrated by the bell-shaped normal 
curve in Figure 6.3. The probability density function that defines the bell-shaped curve of 
the normal distribution follows. 


FIGURE 6.3 Bell-Shaped Curve for the Normal Distribution 


Standard Deviation o 


'This exercise is based on a problem suggested to us by Professor Roger Myerson of Northwestern University. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


280 


The normal curve has two 
parameters, p and o. They 
determine the location 
and shape of the normal 
distribution. 


Chapter 6 Continuous Probability Distributions 


NORMAL PROBABILITY DENSITY FUNCTION 


1 fx—n\? 
fa) = yas a z) (6.2) 
where 
u = mean 
o = standard deviation 
m = 3.14159 
e = 2.71828 


We make several observations about the characteristics of the normal distribution. 


me 


P 


The entire family of normal distributions is differentiated by two parameters: the 
mean u and the standard deviation o. 

The highest point on the normal curve is at the mean, which is also the median and 
mode of the distribution. 

The mean of the distribution can be any numerical value: negative, zero, or posit- 


ive. Three normal distributions with the same standard deviation but three different 

means (— 10, 0, and 20) are shown in Figure 6.4. 

The normal distribution is symmetric, with the shape of the normal curve to the 

left of the mean a mirror image of the shape of the normal curve to the right of 

the mean. The tails of the normal curve extend to infinity in both directions and 

theoretically never touch the horizontal axis. Because it is symmetric, the normal 

distribution is not skewed; its skewness measure is zero. 

The standard deviation determines how flat and wide the normal curve is. Larger 

values of the standard deviation result in wider, flatter curves, showing more vari- 

ability in the data. Two normal distributions with the same mean but with different 

standard deviations are shown in Figure 6.5. 

6. Probabilities for the normal random variable are given by areas under the normal 
curve. The total area under the curve for the normal distribution is 1. Because the 
distribution is symmetric, the area under the curve to the left of the mean is .50 and 
the area under the curve to the right of the mean is .50. 

7. The percentage of values in some commonly used intervals are 
a. 68.3% of the values of a normal random variable are within plus or minus one 

standard deviation of its mean; 


wn 


These percentages are the b. 95.4% of the values of a normal random variable are within plus or minus two 
basis for the empirical rule standard deviations of its mean; 
introduced in Section 3.3. c. 99.7% of the values of a normal random variable are within plus or minus three 


standard deviations of its mean. 


Figure 6.6 shows properties (a), (b), and (c) graphically. 


FIGURE 6.4 Normal Distributions with Same Standard Deviation and 


Different Means 


= Iho) 0 20 
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FIGURE 6.5 Normal Distributions with Same Mean and Different Standard 
Deviations 


FIGURE 6.6 Areas Under the Curve for Any Normal Distribution 


OBA) <=} 


68.3% 


Standard Normal Probability Distribution 


A random variable that has a normal distribution with a mean of zero and a standard de- 
viation of one is said to have a standard normal probability distribution. The letter z is 
commonly used to designate this particular normal random variable. Figure 6.7 is the graph 
of the standard normal distribution. It has the same general appearance as other normal 
distributions, but with the special properties of u = 0 and ø = 1. 

Because u = 0 and o = 1, the formula for the standard normal probability density func- 
tion is a simpler version of equation (6.2). 


STANDARD NORMAL DENSITY FUNCTION 
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FIGURE 6.7 The Standard Normal Distribution 


As with other continuous random variables, probability calculations with any nor- 
mal distribution are made by computing areas under the graph of the probability density 
function. Thus, to find the probability that a normal random variable is within any specific 
For the normal probability interval, we must compute the area under the normal curve over that interval. 
density function, the For the standard normal distribution, areas under the normal curve have been computed 
height of the normal curve and are available in tables that can be used to compute probabilities. Such a table of areas 
varies and more advanced under the normal curve is included in the online supplemental materials for this text. 


mathematics is required The three types of probabilities we need to compute include (1) the probability 
to compute the areas that that the standard normal random variable z will be less than or equal to a given value; 
represent probability. (2) the probability that z will be between two given values; and (3) the probability that 


z will be greater than or equal to a given value. To see how the cumulative probability 
table for the standard normal distribution can be used to compute these three types of 
probabilities, let us consider some examples. 


Because the standard We start by showing how to compute the probability that z is less than or equal to 1.00; 
normal random that is, P(z = 1.00). This cumulative probability is the area under the normal curve to the 
variable is continuous, left of z = 1.00 in Figure 6.8. 

P(z < 1.00) = P(z < 1.00). Refer to the standard normal probability table from the online supplemental materi- 


als for this text. The cumulative probability corresponding to z = 1.00 is the table value 

located at the intersection of the row labeled 1.0 and the column labeled .00. First we find 
1.0 in the left column of the table and then find .00 in the top row of the table. By looking 
in the body of the table, we find that the 1.0 row and the .00 column intersect at the value 


FIGURE 6.8 Cumulative Probability for Standard Normal Distribution 


Corresponding to P(z = 1.00) 


P(z = 1.00) 
= .8413 
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of .8413; thus, P(z = 1.00) = .8413. The following excerpt from the standard normal 
probability table shows these steps. 


z .00 .01 .02 
9 .8159 .8186 8212 
1.0 8413 8438 8461 
1.1 .8643 .8665 .8686 
1.2 .8849 .8869 .8888 
P(z = 1.00) 


To illustrate the second type of probability calculation, we show how to compute the 
probability that z is in the interval between —.50 and 1.25; that is, P(—.50 = z = 1.25). 
Figure 6.9 shows this area, or probability. 

Three steps are required to compute this probability. First, we find the area under the 
normal curve to the left of z = 1.25. Second, we find the area under the normal curve to 
the left of z = —.50. Finally, we subtract the area to the left of z = —.50 from the area to 
the left of z = 1.25 to find P(—.50 S z = 1.25). 

To find the area under the normal curve to the left of z = 1.25, we first locate the 1.2 row 
in the standard normal probability table and then move across to the .05 column. Because 
the table value in the 1.2 row and the .05 column is .8944, P(z = 1.25) = .8944. Similarly, 
to find the area under the curve to the left of z = —.50, we use the left-hand page of the 
table to locate the table value in the —.5 row and the .00 column; with a table value of .3085, 
PR = —.50) = .3085. Thus, P(—.50 = z = 1.25) = P(z = 1.25) — Piz = —.50) = .8944 
3085 = .5859. 

Let us consider another example of computing the probability that z is in the interval be- 
tween two given values. Often it is of interest to compute the probability that a normal ran- 
dom variable assumes a value within a certain number of standard deviations of the mean. 
Suppose we want to compute the probability that the standard normal random variable is 
within one standard deviation of the mean; that is, P(— 1.00 = z = 1.00). To compute this 
probability we must find the area under the curve between — 1.00 and 1.00. Earlier we found 
that P(z = 1.00) = .8413. Referring again to the standard normal probability table from the 


FIGURE 6.9 Probability for Standard Normal Distribution Corresponding 


to P(—.50 = z = 1.25) 


P(—.50 = z = 1.25) 
= 8944 — .3085 = .5859 


BPa = 50) 
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FIGURE 6.10 Probability for Standard Normal Distribution Corresponding 


to P(—1.00 = z < 1.00) 


P(e OO eee 100) 
= .8413 — .1587 = .6826 


P(z = 
= 1587 


aol OOO O 


online supplemental materials for this text, we find that the area under the curve to the left 
of z = —1.00 is .1587, so P(z = — 1.00) = .1587. Therefore, P(— 1.00 = z = 1.00) = 

P(z = 1.00) — P(z = — 1.00) =.8413 — .1587 = .6826. This probability is shown 
graphically in Figure 6.10. 

To illustrate how to make the third type of probability computation, suppose we want 
to compute the probability of obtaining a z value of at least 1.58; that is, P(z = 1.58). The 
value in the z = 1.5 row and the .08 column of the cumulative normal table is .9429; thus, 
P(z < 1.58) = .9429. However, because the total area under the normal curve is 1, 

P(z = 1.58) = 1 — .9429 = .0571. This probability is shown in Figure 6.11. 

In the preceding illustrations, we showed how to compute probabilities given speci- 
fied z values. In some situations, we are given a probability and are interested in work- 
ing backward to find the corresponding z value. Suppose we want to find a z value such 
that the probability of obtaining a larger z value is .10. Figure 6.12 shows this situation 
graphically. 

This problem is the inverse of those in the preceding examples. Previously, we spec- 
ified the z value of interest and then found the corresponding probability, or area. In this 
example, we are given the probability, or area, and asked to find the corresponding z value. 
To do so, we use the standard normal probability table somewhat differently. 

Recall that the standard normal probability table gives the area under the curve to 
the left of a particular z value. We have been given the information that the area in the 
upper tail of the curve is .10. Hence, the area under the curve to the left of the unknown 
z value must equal .9000. Scanning the body of the table, we find that .8997 is the 


FIGURE 6.11 Probability for Standard Normal Distribution Corresponding 


to P(z = 1.58) 


P(z < 1.58) = .9429 


iA = 11503) 
= 1.0000 — .9429 = .0571 
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Given a probability, we can 
use the standard normal 
table in an inverse fashion 
to find the corresponding 


z value. 
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FIGURE 6.12 Finding a z Value Such That Probability of Obtaining a Larger z 


Value Is .10 


Probability = .10 


Z 
= à =j 0 a +2 


What is this z value? 


cumulative probability value closest to .9000. The section of the table providing this 
result follows. 


Z .06 .07 .08 .09 
1.0 .8554 8577 8599 8621 
11 8770 8790 8810 8830 
1.2 8962 8980 8997 9015 
1.3 DBI .9147 .9162 ONT 
1.4 9279 9292 .9306 OBIS 

. Cumulative probability value 

5 closest to .9000 


Reading the z value from the leftmost column and the top row of the table, we find 
that the corresponding z value is 1.28. Thus, an area of approximately .9000 (actually 
.8997) will be to the left of z = 1.28.° In terms of the question originally asked, there is an 
approximately .10 probability of a z value larger than 1.28. 

The examples illustrate that the table of cumulative probabilities for the standard normal 
probability distribution can be used to find probabilities associated with values of the standard 
normal random variable z. Two types of questions can be asked. The first type of question 
specifies a value, or values, for z and asks us to use the table to determine the corresponding 
areas or probabilities. The second type of question provides an area, or probability, and asks 
us to use the table to determine the corresponding z value. Thus, we need to be flexible in 
using the standard normal probability table to answer the desired probability question. In 
most cases, sketching a graph of the standard normal probability distribution and shading the 
appropriate area will help to visualize the situation and aid in determining the correct answer. 


Computing Probabilities for Any Normal Probability Distribution 


The reason for discussing the standard normal distribution so extensively is that probabilities 
for all normal distributions are computed by using the standard normal distribution. That is, 


? We could use interpolation in the body of the table to get a better approximation of the z value that corresponds to an 
area of .9000. Doing so to provide one more decimal place of accuracy would yield a z value of 1.282. However, in most 
practical situations, sufficient accuracy is obtained simply by using the table value closest to the desired probability. And 
if additional accuracy is needed, one can use most any statistical software to calculate the z value. Later in this chapter 
we show how to calculate z values using Excel. 
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when we have a normal distribution with any mean u and any standard deviation o, we 
answer probability questions about the distribution by first converting to the standard normal 
distribution. Then we can use the standard normal probability table and the appropriate z val- 
ues to find the desired probabilities. The formula used to convert any normal random variable 
x with mean pu and standard deviation ø to the standard normal random variable z follows. 


The formula for the 


standard normal random CONVERTING TO THE STANDARD NORMAL RANDOM VARIABLE 


variable is similar to the XH 

i , g= (6.3) 
formula we introduced in o 
Chapter 3 for computing 


z-scores for a data set. . . 
A value of x equal to its mean p results in z = (u — u)/o = 0. Thus, we see that a 


value of x equal to its mean u corresponds to z = 0. Now suppose that x is one stan- 
dard deviation above its mean; that is, x = u + o. Applying equation (6.3), we see that 
the corresponding z value is z = [(u + o) — p]/o = o/o = 1. Thus, an x value that is 
one standard deviation above its mean corresponds to z = 1. In other words, we can 
interpret z as the number of standard deviations that the normal random variable x is 
from its mean p. 

To see how this conversion enables us to compute probabilities for any normal distribu- 
tion, suppose we have a normal distribution with u = 10 and ø = 2. What is the probability 
that the random variable x is between 10 and 14? Using equation (6.3), we see that at x = 10, 
z = (x — p)/o = (10 — 10)/2 = 0 and that at x = 14, z = (14 — 10)/2 = 4/2 = 2. Thus, 
the answer to our question about the probability of x being between 10 and 14 is given by 
the equivalent probability that z is between 0 and 2 for the standard normal distribution. In 
other words, the probability that we are seeking is the probability that the random variable 
x is between its mean and two standard deviations above the mean. Using z = 2.00 and the 
standard normal probability table from the online supplemental materials for this text, 
we see that P(z = 2) = .9772. Because P(z = 0) = .5000, we can compute P(.00 = z = 2.00) = 
P(z = 2) — P(z = 0) = .9772 — .5000 = .4772. Hence the probability that x is between 10 
and 14 is .4772. 


Grear Tire Company Problem 


We turn now to an application of the normal probability distribution. Suppose the Grear 
Tire Company developed a new steel-belted radial tire to be sold through a national chain 
of discount stores. Because the tire is a new product, Grear’s managers believe that the 
mileage guarantee offered with the tire will be an important factor in the acceptance of the 
product. Before finalizing the tire mileage guarantee policy, Grear’s managers want prob- 
ability information about the number of miles the tires will last. Let x denote the number of 
miles the tire lasts. 

From actual road tests with the tires, Grear’s engineering group estimated that the mean 
tire mileage is u = 36,500 miles and that the standard deviation is æ = 5,000. In addition, 
the data collected indicate that a normal distribution is a reasonable assumption. What per- 
centage of the tires can be expected to last more than 40,000 miles? In other words, what is 
the probability that the tire mileage, x, will exceed 40,000? This question can be answered 
by finding the area of the darkly shaded region in Figure 6.13. 

At x = 40,000, we have 


_ X= H _ 40,000 — 36,500 _ 3500 _ |, 
4 o 5000 5000 ` 


Refer now to the bottom of Figure 6.13. We see that a value of x = 40,000 on the Grear 
Tire normal distribution corresponds to a value of z = .70 on the standard normal distribu- 
tion. Using the standard normal probability table, we see that the area under the standard 
normal curve to the left of z = .70 is .7580. Thus, 1.000 — .7580 = .2420 is the probability 
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The guaranteed 
mileage we need to find 
is 1.28 standard devi- 
ations below the mean. 
Thus, x = p — 1.280. 
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FIGURE 6.13 | Grear Tire Company Mileage Distribution 


P(x < 40,000) ao = 5,000 


P(x 2 40,000) = ? 


H = 36,500 
So o ee ae O z 
0 .70 


Note: z = 0 corresponds an am Note: z = .70 corresponds 
tox = M = 36,500 to x = 40,000 


that z will exceed .70 and hence x will exceed 40,000. We can conclude that about 24.2% 
of the tires will exceed 40,000 in mileage. 

Let us now assume that Grear is considering a guarantee that will provide a discount on 
replacement tires if the original tires do not provide the guaranteed mileage. What should 
the guaranteed mileage be if Grear wants no more than 10% of the tires to be eligible for 
the discount guarantee? This question is interpreted graphically in Figure 6.14. 

According to Figure 6.14, the area under the curve to the left of the unknown guaranteed 
mileage must be .10. So, we must first find the z value that cuts off an area of .10 in the left 
tail of a standard normal distribution. Using the standard normal probability table, we see that 


z = — 1.28 cuts off an area of .10 in the lower tail. Hence, z = — 1.28 is the value of the stan- 
dard normal random variable corresponding to the desired mileage guarantee on the Grear 
Tire normal distribution. To find the value of x corresponding to z = — 1.28, we have 

p= 

z=% = -128 
o 
x— pu = -1.280 
x= u 1.280 


FIGURE 6.14 | Grear’s Discount Guarantee 


ao = 5,000 


10% of tires eligible 
for discount guarantee 


Guarantee u = 36,500 
mileage = ? 
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With the guarantee set at 
30,000 miles, the actual 
percentage eligible for the 
guarantee will be 9.68%. 


The letter S that appears 
in the name of the 
NORM.S.DIST and 
NORM.S.INV functions 
reminds us that these 
functions relate to 

the standard normal 
probability distribution. 


The probabilities in cells 
D4, .5858, and D5, .6827, 
differ from what we 
computed earlier due 

to rounding. 
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With u = 36,500 and ø = 5,000, 
x = 36,500 — 1.28(5,000) = 30,100 


Thus, a guarantee of 30,100 miles will meet the requirement that approximately 10% of 
the tires will be eligible for the guarantee. Perhaps, with this information, the firm will set 
its tire mileage guarantee at 30,000 miles. 

Again, we see the important role that probability distributions play in providing 
decision-making information. Namely, once a probability distribution is established for a 
particular application, it can be used to obtain probability information about the problem. 
Probability does not make a decision recommendation directly, but it provides information 
that helps the decision maker better understand the risks and uncertainties associated with 
the problem. Ultimately, this information may assist the decision maker in reaching a good 
decision. 


Using Excel to Compute Normal Probabilities 


Excel provides two functions for computing probabilities and z values for a standard 
normal probability distribution: NORM.S.DIST and NORM.S.INV. The NORM.S.DIST 
function computes the cumulative probability given a z value, and the NORM.S.INV 
function computes the z value given a cumulative probability. Two similar functions, 
NORM.DIST and NORMLINV, are available for computing the cumulative probability 
and the x value for any normal distribution. We begin by showing how to use the 
NORM.S.DIST and NORM.S.INV functions. 

The NORM.S.DIST function provides the area under the standard normal curve to 
the left of a given z value; thus, it provides the same cumulative probability we would 
obtain if we used the standard normal probability table from the online supplemental 
materials for this text. Using the NORM.S.DIST function is just like having Excel look up 
cumulative normal probabilities for you. The NORM.S.INV function is the inverse of the 
NORM.S.DIST function; it takes a cumulative probability as input and provides the z value 
corresponding to that cumulative probability. 

Let’s see how both of these functions work by computing the probabilities and z values 
obtained earlier in this section using the standard normal probability table. Refer to 
Figure 6.15 as we describe the tasks involved. The formula worksheet is in the back- 
ground; the value worksheet is in the foreground. 


Enter/Access Data: Open a blank worksheet. No data are entered in the worksheet. We will 
simply enter the appropriate z values and probabilities directly into the formulas as needed. 


Enter Functions and Formulas: The NORM.S.DIST function has two inputs: the 
z value and a value of TRUE or FALSE. For the second input we enter TRUE if a cu- 
mulative probability is desired, and we enter FALSE if the height of the standard normal 
curve is desired. Because we will always be using NORM.S.DIST to compute cumulative 
probabilities, we always choose TRUE for the second input. To illustrate the use of the 
NORM.S.DIST function, we compute the four probabilities shown in cells D3:D6 of 
Figure 6.15. 


To compute the cumulative probability to the left of a given z value (area in lower tail), 
we simply evaluate NORM.S.DIST at the z value. For instance, to compute P(z = 1) we 
entered the formula =NORM.S.DIST(1, TRUE) in cell D3. The result, .8413, is the same as 
obtained using the standard normal probability table. 

To compute the probability of z being in an interval we compute the value of NORM.S.DIST 
at the upper endpoint of the interval and subtract the value of NORM.S.DIST at the lower 
endpoint of the interval. For instance, to find P(—.50 = z = 1.25), we entered the formula 
=NORM.S.DIST(1.25, TRUE)—NORM.S.DIST(-.50, TRUE) in cell D4. The interval probability 
in cell D5 is computed in a similar fashion. 
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FIGURE 6.15 | Excel Worksheet for Computing Probabilities and z Values 
for the Standard Normal Distribution 


c D E 
Probabilities: Standard Normal Distribution 


> 
v 


P( <= 1) =NORM.S.DIST(1,TRUE) 
P(-.50 <= z <= 1.25) “NORM.S.DIST(1.25, TRUE)-NORM.S.DIST(-0.5, TRUE) 
P(-1.00 <= z <= 1.00) =NORM.S.DIST(1,TRUE)-NORM.S.DIST(-1,TRUE) 
P(z >= 1.58) =1-NORM.S.DIST(1.58, TRUE) 


Finding z-values Given Probabilities 
z value with .10 in upper tail “NORM.S.INV(0.9) 


z value with .025 in upper tail "NORM.S.INV(0.975) 
z value with .025 in lower tail "NORM.S.INV(0.025) 


A B c D E 
Probabilities: Standard Normal Distribution 


Sane svrrswA Aaveuwnny 
à 


PES) 0.8413 

P(-.50 <= z <= 1.25) 0.S858 
P(-1.00 <= z <= 1.00) 0.6827 
P( >= 1.58) 0.0571 


C E WH 


9 Finding z -values Given Probabilities 


1 z value with .10 in upper tail 128 
12 z value with .025 in upper tail 1.96 
13 = value with .025 in lower tail 71,96 


To compute the probability to the right of a given z value (upper tail area), we must 
subtract the cumulative probability represented by the area under the curve below the 
z value (lower tail area) from 1. For example, to compute P(z = 1.58) we entered the 
formula = /—NORM.S.DIST(1.58, TRUE) in cell D6. 

To compute the z value for a given cumulative probability (lower tail area), we use 
the NORM.S.INV function. To find the z value corresponding to an upper tail proba- 
bility of .10, we note that the corresponding lower tail area is .90 and enter the formula 
=NORM.S.INV(0.9) in cell D11. Actually, NORM.S.INV(0.9) gives us the z value providing 
a cumulative probability (lower tail area) of .9. But it is also the z value associated with an 
upper tail area of 1 — .9 = .10. 

Two other z values are computed in Figure 6.15. These z values will be used exten- 
sively in succeeding chapters. To compute the z value corresponding to an upper tail 
probability of .025, we entered the formula =NORM.S.INV(0.975) in cell D12. To com- 
pute the z value corresponding to a lower tail probability of .025, we entered the formula 
=NORM.S.INV(0.025) in cell D13. We see that z = 1.96 corresponds to an upper tail 
probability of .025, and z = — 1.96 corresponds to a lower tail probability of .025. 

Let us now turn to the Excel functions for computing cumulative probabilities and x values 
for any normal distribution. The NORM.DIST function provides the area under the normal 
curve to the left of a given value of the random variable x; thus it provides cumulative probabili- 
ties. The NORMLINV function is the inverse of the NORM.DIST function; it takes a cumulative 
probability as input and provides the value of x corresponding to that cumulative probability. 
The NORM.DIST and NORMLINV functions do the same thing for any normal distribution that 
the NORM.S.DIST and NORM.S.INV functions do for the standard normal distribution. 

Let’s see how both of these functions work by computing probabilities and x values for 
the Grear Tire Company example introduced earlier in this section. Recall that the lifetime 
of a Grear tire has a mean of 36,500 miles and a standard deviation of 5000 miles. Refer to 
Figure 6.16 as we describe the tasks involved. The formula worksheet is in the background; 
the value worksheet is in the foreground. 


Enter/Access Data: Open a blank worksheet. No data are entered in the worksheet. 
We simply enter the appropriate x values and probabilities directly into the formulas as 
needed. 
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FIGURE 6.16 | Excel Worksheet for Computing Probabilities and x Values 
for the Normal Distribution 


Probabilities: Normal Distribution 
P(x <= 20000) =NORM.DIST(20000,36500,5000, TRUE) 
P(20000 <= x <= 40000) =NORM.DIST(40000,36500,5000, TRUE )-NORM.DIST(20000,36500,5000, TRUE) 
P(x >= 40000) =1-NORM.DIST(40000,36500,5000, TRUE) 


Finding x values Given Probabilities 


ras nwrk Whe 


o 


x value with .10 in lower tail =NORM.INV(0. 1,36500,5000) i B c D E 
x value with .025 in upper tail “NORM.INV(0.975,36500,5000) Probabilities: Normal Distribution 


= 
-o 


P(x <= 20000) 0.0005 
P(20000 <= x <= 40000) 0.7576 
P(x >= 40000) 0.2420 


4 
1 
2 + 
3 
4 + 
5 4 
6 
7 Finding x values Given Probabilities 

8 


9 x value with .10 in lower tail 30092.24 
10 x value with .025 in upper tail 46299.82 
11 


Enter Functions and Formulas: The NORM.DIST function has four inputs: (1) the 
x value we want to compute the cumulative probability for, (2) the mean, (3) the standard 
deviation, and (4) a value of TRUE or FALSE. For the fourth input, we enter TRUE if a 
cumulative probability is desired, and we enter FALSE if the height of the curve is desired. 
Because we will always be using NORM.DIST to compute cumulative probabilities, we will 
always choose TRUE for the fourth input. 


To compute the cumulative probability to the left of a given x value (lower tail 
area), we simply evaluate NORM.DIST at the x value. For instance, to compute the 
probability that a Grear tire will last 20,000 miles or less, we entered the formula 
=NORM.DIST(20000,36500,5000, TRUE) in cell D3. The value worksheet shows that 
this cumulative probability is .0005. So, we can conclude that almost all Grear tires 
will last at least 20,000 miles. 

To compute the probability of x being in an interval we compute the value of 
NORM.DIST at the upper endpoint of the interval and subtract the value of NORM.DIST 
at the lower endpoint of the interval. The formula in cell D4 provides the probability that a 
tire’s lifetime is between 20,000 and 40,000 miles, P(20,000 = x = 40,000). In the value 
worksheet, we see that this probability is .7576. 

To compute the probability to the right of a given x value (upper tail area), we must sub- 
tract the cumulative probability represented by the area under the curve below the x value 
(lower tail area) from 1. The formula in cell D5 computes the probability that a Grear tire 
will last for at least 40,000 miles. We see that this probability is .2420. 

To compute the x value for a given cumulative probability, we use the NORM.INV 
function. The NORM.INV function has only three inputs. The first input is the cumulative 
probability; the second and third inputs are the mean and standard deviation. For instance, 
to compute the tire mileage corresponding to a lower tail area of .1 for Grear Tire, we enter 
the formula =NORM.INV(0.1,36500,5000) in cell D9. From the value worksheet, we see 
that 10% of the Grear tires will last for 30,092.24 miles or less. 

To compute the minimum tire mileage for the top 2.5% of Grear tires, we want to find 
the value of x corresponding to an area of .025 in the upper tail. This calculation is the same 
as finding the x value that provides a cumulative probability of .975. Thus we entered the 
formula =NORM_.INV(0.975,36500,5000) in cell D10 to compute this tire mileage. From 
the value worksheet, we see that 2.5% of the Grear tires will last at least 46,299.82 miles. 
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EXERCISES 


Methods 


8. 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


Using Figure 6.6 as a guide, sketch a normal curve for a random variable x that has a 
mean of u = 100 and a standard deviation of o = 10. Label the horizontal axis with 
values of 70, 80, 90, 100, 110, 120, and 130. 


. Arandom variable is normally distributed with a mean of = 50 and a standard devi- 


ation of o = 5. 

a. Sketch a normal curve for the probability density function. Label the horizontal 
axis with values of 35, 40, 45, 50, 55, 60, and 65. Figure 6.6 shows that the normal 
curve almost touches the horizontal axis at three standard deviations below and at 
three standard deviations above the mean (in this case at 35 and 65). 

b. What is the probability that the random variable will assume a value between 45 
and 55? 

c. What is the probability that the random variable will assume a value between 40 
and 60? 

Draw a graph for the standard normal distribution. Label the horizontal axis at values 

of —3, —2, —1, 0, 1, 2, and 3. Then compute the following probabilities. 

a. P(z = 1.5) 

b. PR = 1) 

ce PU =z=15) 

d. P(0 < z < 2.5) 

Given that z is a standard normal random variable, compute the following probabilities. 

a. P(z = —1.0) 


b. Piz = -1) 
ë P(z = —-1.5) 
d. P(—2.5 Sz) 


ë: P(-3<z=0) 
Given that z is a standard normal random variable, compute the following probabilities. 
a. PO=z=.83) 
b. P(—1.57 Sz = 0) 
. P(z > .44) 
, P(z = —.23) 
. P(z < 1.20) 
. PZ =-.71) 
Given that z is a standard normal random variable, compute the following probabilities. 
a. P(-1.98 =z = .49) 
b P(.52 Sz 1.22) 
c. P(-1.75 =z = —1.04) 
Given that z is a standard normal random variable, find z for each situation. 
. The area to the left of z is .9750. 
. The area between 0 and z is .4750. 
. The area to the left of z is .7291. 
. The area to the right of z is .1314. 
. The area to the left of z is .6700. 
The area to the right of z is .3300. 
Given that z is a standard normal random variable, find z for each situation. 
a. The area to the left of z is .2119. 
. The area between —z and z is .9030. 
. The area between —z and z is .2052. 
. The area to the left of z is .9948. 
. The area to the right of z is .6915. 
Given that z is a standard normal random variable, find z for each situation. 
a. The area to the right of z is .01. 
b. The area to the right of z is .025. 


moana 


moalnde ® 


ao ot 
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c. The area to the right of z is .05. 
d. The area to the right of z is .10. 


Applications 

17. Height of Dutch Men. Males in the Netherlands are the tallest, on average, in the 
world with an average height of 183 centimeters (cm) (BBC News website). Assume 
that the height of men in the Netherlands is normally distributed with a mean of 
183 cm and standard deviation of 10.5 cm. 

a. What is the probability that a Dutch male is shorter than 175 cm? 

b. What is the probability that a Dutch male is taller than 195 cm? 

c. What is the probability that a Dutch male is between 173 and 193 cm? 

d. Out of a random sample of 1000 Dutch men, how many would we expect to be taller 
than 190 cm? 

18. Large-Cap Domestic Stock Fund. The average return for large-cap domestic stock 
funds over three years was 14.4%. Assume the three-year returns were normally 
distributed across funds with a standard deviation of 4.4%. 

a. What is the probability an individual large-cap domestic stock fund had a three-year 
return of at least 20%? 

b. What is the probability an individual large-cap domestic stock fund had a three-year 
return of 10% or less? 

c. How big does the return have to be to put a domestic stock fund in the top 10% for 
the three-year period? 

19. Automobile Repair Costs. Automobile repair costs continue to rise with the average 
cost now at $367 per repair (U.S. News & World Report website). Assume that the 
cost for an automobile repair is normally distributed with a standard deviation of $88. 
Answer the following questions about the cost of automobile repairs. 

a. What is the probability that the cost will be more than $450? 

b. What is the probability that the cost will be less than $250? 

c. What is the probability that the cost will be between $250 and $450? 

d. If the cost for your car repair is in the lower 5% of automobile repair charges, what 
is your cost? 

20. Gasoline Prices. Suppose that the average price for a gallon of gasoline in the United 
States is $3.73 and in Russia it is $3.40. Assume these averages are the population 
means in the two countries and that the probability distributions are normally distrib- 
uted with a standard deviation of $.25 in the United States and a standard deviation of 
$.20 in Russia. 

a. What is the probability that a randomly selected gas station in the United States 
charges less than $3.50 per gallon? 

b. What percentage of the gas stations in Russia charge less than $3.50 per gallon? 

c. What is the probability that a randomly selected gas station in Russia charged more 
than the mean price in the United States? 

21. Mensa Membership. A person must score in the upper 2% of the population on an IQ 
test to qualify for membership in Mensa, the international high-IQ society. If IQ scores 
are normally distributed with a mean of 100 and a standard deviation of 15, what score 
must a person have to qualify for Mensa? 

22. Television Viewing. Suppose that the mean daily viewing time of television is 
8.35 hours. Use a normal probability distribution with a standard deviation of 
2.5 hours to answer the following questions about daily television viewing per 
household. 

a. What is the probability that a household views television between 5 and 10 hours a 
day? 

b. How many hours of television viewing must a household have in order to be in the 
top 3% of all television viewing households? 

c. What is the probability that a household views television more than 3 hours a day? 
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23. Time to Complete Final Exam. The time needed to complete a final examination in 
a particular college course is normally distributed with a mean of 80 minutes and a 
standard deviation of 10 minutes. Answer the following questions. 


a. 
b. 


What is the probability of completing the exam in one hour or less? 
What is the probability that a student will complete the exam in more than 60 
minutes but less than 75 minutes? 


. Assume that the class has 60 students and that the examination period is 90 minutes 


in length. How many students do you expect will be unable to complete the exam in 
the allotted time? 


24. Labor Day Travel Costs. The American Automobile Association (AAA) reported that 
families planning to travel over the Labor Day weekend would spend an average of $749. 
Assume that the amount spent is normally distributed with a standard deviation of $225. 


a. 
b. 
G 


d. 


What is the probability of family expenses for the weekend being less than $400? 

What is the probability of family expenses for the weekend being $800 or more? 

What is the probability that family expenses for the weekend will be between $500 

and $1,000? 

What would the Labor Day weekend expenses have to be for the 5% of the famil- 
ies with the most expensive travel plans? 


25. Household Income in Maryland. According to Money magazine, Maryland had the 
highest median annual household income of any state in 2018 at $75,847 (Time.com 
website). Assume that annual household income in Maryland follows a normal distri- 
bution with a median of $75,847 and standard deviation of $33,800. 


a. 


b. 


What is the probability that a household in Maryland has an annual income of 
$100,000 or more? 

What is the probability that a household in Maryland has an annual income of 
$40,000 or less? 


. What is the probability that a household in Maryland has an annual income between 


$50,000 and $70,000? 


. What is the annual income of a household in the 90th percentile of annual house- 


hold income in Maryland? 


6.3 Exponential Probability Distribution 


The exponential probability distribution may be used for random variables such 
as the time between arrivals at a car wash, the time required to load a truck, and the 
distance between major defects in a highway. The exponential probability density 
function follows. 


EXPONENTIAL PROBABILITY DENSITY FUNCTION 


1 
f(x) = a Gate forx =0 (6.4) 


where 


= expected value or mean 
e = 2.71828 


As an example of the exponential distribution, suppose that x represents the loading time 
for a truck at the Schips loading dock and follows such a distribution. If the mean, or average, 
loading time is 15 minutes (u = 15), the appropriate probability density function for x is 


= —x/15 
fa) = 75 


Figure 6.17 is the graph of this probability density function. 
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FIGURE 6.17 Exponential Distribution for the Schips Loading Dock Example 
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Computing Probabilities for the Exponential Distribution 


In waiting line applications, As with any continuous probability distribution, the area under the curve corresponding 
the exponential distribution to an interval provides the probability that the random variable assumes a value in that 
is often used for service interval. In the Schips loading dock example, the probability that loading a truck will take 
time. 6 minutes or less P(x = 6) is defined to be the area under the curve in Figure 6.17 from 
x = 0 to x = 6. Similarly, the probability that the loading time will be 18 minutes or less 
P(x = 18) is the area under the curve from x = 0 to x = 18. Note also that the probability 
that the loading time will be between 6 minutes and 18 minutes P(6 = x = 18) is given by 
the area under the curve from x = 6 tox = 18. 
To compute exponential probabilities such as those just described, we use the following 
formula. It provides the cumulative probability of obtaining a value for the exponential 
random variable of less than or equal to some specific value denoted by xp. 


EXPONENTIAL DISTRIBUTION: CUMULATIVE PROBABILITIES 
P@=%)=1—e =" (6.5) 
For the Schips loading dock example, x = loading time in minutes and u = 15 minutes. 
Using equation (6.5), 
POSE) =1-e%/ 
Hence, the probability that loading a truck will take 6 minutes or less is 
P(x = 6) = 1 — e 8/9 = 3297 
Using equation (6.5), we calculate the probability of loading a truck in 18 minutes or less. 
P(x =< 18) = 1 — e 8/5 = 6988 


Thus, the probability that loading a truck will take between 6 minutes and 18 minutes is equal 
to .6988 — .3297 = .3691. Probabilities for any other interval can be computed similarly. 
A property of the In the preceding example, the mean time it takes to load a truck is u = 15 minutes. A 
exponential distribution is | property of the exponential distribution is that the mean of the distribution and the standard 
that the mean and standard deviation of the distribution are equal. Thus, the standard deviation for the time it takes to 
deviation are equal. load a truck is ø = 15 minutes. The variance is o° = (15)? = 225. 
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If arrivals follow a Poisson 
distribution, the time 
between arrivals must 
follow an exponential 
distribution. 
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Relationship Between the Poisson 
and Exponential Distributions 


In Section 5.5 we introduced the Poisson distribution as a discrete probability distribution 
that is often useful in examining the number of occurrences of an event over a specified 
interval of time or space. Recall that the Poisson probability function is 


pře" 


fœ) = 


x! 
where 


u = expected value or mean number of 
occurrences over a specified interval 


The continuous exponential probability distribution is related to the discrete Poisson dis- 
tribution. If the Poisson distribution provides an appropriate description of the number of 
occurrences per interval, the exponential distribution provides a description of the length of 
the interval between occurrences. 

To illustrate this relationship, suppose the number of cars that arrive at a car wash dur- 
ing one hour is described by a Poisson probability distribution with a mean of 10 cars per 
hour. The Poisson probability function that gives the probability of x arrivals per hour is 


10%e 1° 


x! 


fœ = 


Because the average number of arrivals is 10 cars per hour, the average time between cars 
arriving is 


1 hour 


= 1h 
10 cars our/car 


Thus, the corresponding exponential distribution that describes the time between the 
arrivals has a mean of u = .1 hour per car; as a result, the appropriate exponential proba- 
bility density function is 


1 
fs) = eo = 10e 1% 


Using Excel to Compute Exponential Probabilities 


Excel’s EXPON.DIST function can be used to compute exponential probabilities. We will 
illustrate by computing probabilities associated with the time it takes to load a truck at the 
Schips loading dock. This example was introduced at the beginning of the section. Refer to 
Figure 6.18 as we describe the tasks involved. The formula worksheet is in the background; 
the value worksheet is in the foreground. 


Enter/Access Data: Open a blank worksheet. No data are entered in the worksheet. We 
simply enter the appropriate values for the exponential random variable into the formulas 
as needed. The random variable is x = loading time. 


Enter Functions and Formulas: The EXPON.DIST function has three inputs: The first 
is the value of x, the second is 1/y, and the third is TRUE or FALSE. We choose TRUE for 
the third input if a cumulative probability is desired and FALSE if the height of the proba- 
bility density function is desired. We will always use TRUE because we will be computing 
cumulative probabilities. 


The first probability we compute is the probability that the loading time is 18 minutes 
or less. For the Schips problem, 1/u = 1/15, so we enter the formula 
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FIGURE 6.18 | Excel Worksheet for Computing Probabilities for the 
Exponential Probability Distribution 


A A| SB c D E 
Probabilities: Exponential Distribution 


P(x <= 18) =EXPON.DIST(18,1/15,TRUE) A A A 5 m 
P(6 <=x <= 18) =EXPONDIST(18,1/15.TRUE)-EXPON.DIST(6,1/15,TRUE) Probabilities: Exponential Distribution 
P(x >= 8) =1-EXPON.DIST(8,1/15, TRUE) 


Awe we 


P(x <=18) 0.6988 
P(6<=x<=18) 0.3691 
P(x >=8) 0.5866 


Ae ewes 


=EXPON.DIST (18, 1/15, TRUE) in cell D3 to compute the desired cumulative prob- 
ability. From the value worksheet, we see that the probability of loading a truck in 
18 minutes or less is .6988. 

The second probability we compute is the probability that the loading time is between 6 
and 18 minutes. To find this probability we first compute the cumulative probability for the 
upper endpoint of the time interval and subtract the cumulative probability for the lower 
endpoint of the interval. The formula we have entered in cell D4 calculates this probability. 
The value worksheet shows that this probability is .3691. 

The last probability we calculate is the probability that the loading time is at 
least 8 minutes. Because the EXPON.DIST function computes only cumulative 
(lower tail) probabilities, we compute this probability by entering the formula 
= 1—EXPON.DIST($, 1/15, TRUE) in cell D5. The value worksheet shows that the probabil- 
ity of a loading time of 8 minutes or more is .5866. 


NOTES + COMMENTS 


As we can see in Figure 6.17, the exponential distribution is | exponential distributions is 2. The exponential distribution 
skewed to the right. Indeed, the skewness measure for the gives us a good idea what a skewed distribution looks like. 


EXERCISES yn 


Methods 
26. Consider the following exponential probability density function. 


1 
fx) = z ew forx =0 


. Find P(x S 6). 

. Find P(x = 4). 

. Find P(x = 6). 

. Find P(4 S x = 6). 

27. Consider the following exponential probability density function. 


aodT® 


1 
f= oD forx =0 


Write the formula for P(x = xy). 
Find P(x = 2). 

Find P(x = 3). 

. Find P(x = 5). 

Find P(2 =x <5). 


eno sp 
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Applications 

28. Phone Battery Life. Battery life between charges for a certain mobile phone is 
20 hours when the primary use is talk time, and drops to 7 hours when the phone is 
primarily used for Internet applications over cellular. Assume that the battery life in 
both cases follows an exponential distribution. 

a. Show the probability density function for battery life for this phone when its pri- 
mary use is talk time. 

b. What is the probability that the battery charge for a randomly selected phone will 
last no more than 15 hours when its primary use is talk time? 

c. What is the probability that the battery charge for a randomly selected phone will 
last more than 20 hours when its primary use is talk time? 

d. What is the probability that the battery charge for a randomly selected phone 
will last no more than 5 hours when its primary use is Internet applications? 

29. Arrival of Vehicles to an Intersection. The time between arrivals of vehicles at a 
particular intersection follows an exponential probability distribution with a mean of 
12 seconds. 

a. Sketch this exponential probability distribution. 

b. What is the probability that the arrival time between vehicles is 12 seconds or less? 
c. What is the probability that the arrival time between vehicles is 6 seconds or less? 
d. What is the probability of 30 or more seconds between vehicle arrivals? 

30. Comcast Service Operations. Comcast Corporation is a global telecommunications 
company headquartered in Philadelphia, PA. Generally known for reliable service, the 
company periodically experiences unexpected service interruptions. When service inter- 
ruptions do occur, Comcast customers who call the office receive a message providing 
an estimate of when service will be restored. Suppose that for a particular outage, Com- 
cast customers are told that service will be restored in two hours. Assume that two hours 
is the mean time to do the repair and that the repair time has an exponential probability 
distribution. 

a. What is the probability that the cable service will be repaired in one hour or less? 

b. What is the probability that the repair will take between one hour and two hours? 

c. For a customer who calls the Comcast office at 1:00 P.M., what is the probability 
that the cable service will not be repaired by 5:00 P.M.? 

31. Patient Length of Stays in ICUs. Intensive care units (ICUs) generally treat the sickest 
patients in a hospital. ICUs are often the most expensive department in a hospital because 
of the specialized equipment and extensive training required to be an ICU doctor or nurse. 
Therefore, it is important to use ICUs as efficiently as possible in a hospital. According 
to a 2017 large-scale study of elderly ICU patients, the average length of stay in the ICU 
is 3.4 days (Critical Care Medicine journal article). Assume that this length of stay in the 
ICU has an exponential distribution. 

a. What is the probability that the length of stay in the ICU is one day or less? 

b. What is the probability that the length of stay in the ICU is between two and three 
days? 

c. What is the probability that the length of stay in the ICU is more than five days? 

32. Boston 911 Calls. The Boston Fire Department receives 911 calls at a mean rate of 
1.6 calls per hour (Mass.gov website). Suppose the number of calls per hour follows a 
Poisson probability distribution. 

a. What is the mean time between 911 calls to the Boston Fire Department in 
minutes? 

b. Using the mean in part (a), show the probability density function for the time 
between 911 calls in minutes. 

c. What is the probability that there will be less than one hour between 911 calls? 

d. What is the probability that there will be 30 minutes or more between 911 calls? 

e. What is the probability that there will be more than 5 minutes but less than 20 
minutes between 911 calls? 
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SUMMARY 


This chapter extended the discussion of probability distributions to the case of continuous 
random variables. The major conceptual difference between discrete and continuous proba- 
bility distributions involves the method of computing probabilities. With discrete distri- 
butions, the probability function f(x) provides the probability that the random variable x 
assumes various values. With continuous distributions, the probability density function f(x) 
does not provide probability values directly. Instead, probabilities are given by areas under 
the curve or graph of the probability density function f(x). Because the area under the curve 
above a single point is zero, we observe that the probability of any particular value is zero 
for a continuous random variable. 

Three continuous probability distributions—the uniform, normal, and exponential 
distributions—were treated in detail. The normal distribution is used widely in statistical 
inference and will be used extensively throughout the remainder of the text. 


GLOSSARY 
Exponential probability distribution A continuous probability distribution that is useful 
in computing probabilities for the time it takes to complete a task. 

Normal probability distribution A continuous probability distribution. Its probability 
density function is bell shaped and determined by its mean u and standard deviation ø. 
Probability density function A function used to compute probabilities for a continuous 
random variable. The area under the graph of a probability density function over an inter- 
val represents probability. 

Standard normal probability distribution A normal distribution with a mean of zero and 
a standard deviation of one. 

Uniform probability distribution A continuous probability distribution for which the 
probability that the random variable will assume a value in any interval is the same for 
each interval of equal length. 


KEY FORMULAS 


SCHOHOOSSHSSESSSHSHSHSHSSHHSHSSHSHSHSSHHSESSEHESESEEE®S 


Uniform Probability Density Function 


1 forasx=b (6.1) 
fy = )b-a ` 
0 elsewhere 
Normal Probability Density Function 
ees, 
fœ = Cae (6.2) 
oV 27 
Converting to the Standard Normal Random Variable 
= 
z=. (6.3) 
Oo 
Exponential Probability Density Function 
lw 
fix) = ue te forx =0 (6.4) 
Exponential Distribution: Cumulative Probabilities 
P(x S x) = 1-e7* (6.5) 
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33. Selling a House. A business executive, transferred from Chicago to Atlanta, needs 
to sell her house in Chicago quickly. The executive’s employer has offered to buy 
the house for $210,000, but the offer expires at the end of the week. The executive 
does not currently have a better offer but can afford to leave the house on the mar- 
ket for another month. From conversations with her realtor, the executive believes 
the price she will get by leaving the house on the market for another month is 
uniformly distributed between $200,000 and $225,000. 

a. If she leaves the house on the market for another month, what is the mathematical 
expression for the probability density function of the sales price? 

b. If she leaves it on the market for another month, what is the probability that she will 
get at least $215,000 for the house? 

c. If she leaves it on the market for another month, what is the probability that she will 
get less than $210,000? 

d. Should the executive leave the house on the market for another month? Why or why not? 

34. NCAA Scholarships. The NCAA estimates that the yearly value of a full athletic 
scholarship at in-state public universities is $19,000 (The Wall Street Journal). Assume 
the scholarship value is normally distributed with a standard deviation of $2,100. 

a. For the 10% of athletic scholarships of least value, how much are they worth? 

b. What percentage of athletic scholarships are valued at $22,000 or more? 

c. For the 3% of athletic scholarships that are most valuable, how much are they worth? 

35. Production Defects. Motorola used the normal distribution to determine the probab- 
ility of defects and the number of defects expected in a production process. Assume 
a production process produces items with a mean weight of 10 ounces. Calculate the 
probability of a defect and the expected number of defects for a 1,000-unit production 
run in the following situations. 

a. The process standard deviation is .15, and the process control is set at plus or minus 
one standard deviation. Units with weights less than 9.85 or greater than 10.15 
ounces will be classified as defects. 

b. Through process design improvements, the process standard deviation can be re- 
duced to .05. Assume the process control remains the same, with weights less than 
9.85 or greater than 10.15 ounces being classified as defects. 

c. What is the advantage of reducing process variation, thereby causing process con- 
trol limits to be at a greater number of standard deviations from the mean? 

36. Bringing Items to a Pawnshop. One indicator of the level of economic hardship in 
a community is the number of people who bring items to a pawnbroker. Assume the 
number of people bringing items to the pawnshop per day is normally distributed with 
a mean of 658. 

a. Suppose you learn that on 3% of the days, 610 or fewer people brought items to the 
pawnshop. What is the standard deviation of the number of people bringing items to 
the pawnshop per day? 

b. On any given day, what is the probability that between 600 and 700 people bring 
items to the pawnshop? 

c. How many people bring items to the pawnshop on the busiest 3% of days? 

37. Amazon Alexa App Downloads. Alexa is the popular virtual assistant developed by 
Amazon. Alexa interacts with users using artificial intelligence and voice recognition. 
It can be used to perform daily tasks such as making to-do lists, reporting the news and 
weather, and interacting with other smart devices in the home. In 2018, the Amazon 
Alexa app was downloaded some 2,800 times per day from the Google Play store 
(AppBrain website). Assume that the number of downloads per day of the Amazon 
Alexa app is normally distributed with a mean of 2,800 and standard deviation of 860. 
a. What is the probability there are 2,000 or fewer downloads of Amazon Alexa in 

a day? 
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b. What is the probability there are between 1,500 and 2,500 downloads of Amazon 
Alexa in a day? 

c. What is the probability there are more than 3,000 downloads of Amazon Alexa in 
a day? 

d. Assume that Google has designed its servers so there is probability .01 that the 
number of Amazon Alexa app downloads in a day exceeds the servers’ capacity and 
more servers have to be brought online. How many Amazon Alexa app downloads 
per day are Google’s servers designed to handle? 

38. Service Contract Offer. Ward Doering Auto Sales is considering offering a special 
service contract that will cover the total cost of any service work required on leased 
vehicles. From experience, the company manager estimates that yearly service costs 
are approximately normally distributed, with a mean of $150 and a standard deviation 
of $25. 

a. If the company offers the service contract to customers for a yearly charge of $200, 
what is the probability that any one customer’s service costs will exceed the con- 
tract price of $200? 

b. What is Ward’s expected profit per service contract? 

39. Wedding Costs. The XO Group Inc. conducted a 2015 survey of 13,000 brides and 
grooms married in the United States and found that the average cost of a wedding is 
$29,858 (XO Group website). Assume that the cost of a wedding is normally distributed 
with a mean of $29,858 and a standard deviation of $5,600. 

a. What is the probability that a wedding costs less than $20,000? 

b. What is the probability that a wedding costs between $20,000 and $30,000? 

c. For a wedding to be among the 5% most expensive, how much would it have 
to cost? 

40. College Admissions Test Scores. Assume that the test scores from a college admis- 
sions test are normally distributed, with a mean of 450 and a standard deviation of 100. 
a. What percentage of the people taking the test score between 400 and 500? 

b. Suppose someone receives a score of 630. What percentage of the people taking the 
test score better? What percentage score worse? 

c. Ifa particular university will not admit anyone scoring below 480, what percentage 
of the persons taking the test would be acceptable to the university? 

41. College Graduates Starting Salaries. According to the National Association of 
Colleges and Employers, the 2015 average starting salary for new college graduates 
in health sciences was $51,541. The average starting salary for new college graduates 
in business was $53,901 (National Association of Colleges and Employers website). 
Assume that starting salaries are normally distributed and that the standard deviation 
for starting salaries for new college graduates in health sciences is $11,000. Assume 
that the standard deviation for starting salaries for new college graduates in business is 
$15,000. 

a. What is the probability that a new college graduate in business will earn a starting 
salary of at least $65,000? 

b. What is the probability that a new college graduate in health sciences will earn a 
starting salary of at least $65,000? 

c. What is the probability that a new college graduate in health sciences will earn a 
starting salary less than $40,000? 

d. How much would a new college graduate in business have to earn in order to have 
a starting salary higher than 99% of all starting salaries of new college graduates in 
the health sciences? 

42. Filling Weights. A machine fills containers with a particular product. The stand- 
ard deviation of filling weights is known from past data to be .6 ounce. If only 2% 
of the containers hold less than 18 ounces, what is the mean filling weight for the 
machine? That is, what must u equal? Assume the filling weights have a normal 
distribution. 
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43. Mean Time Between Failures. The mean time between failures (MTBF) is a com- 
mon metric used to measure the performance of manufacturing systems. MTBF is the 
elapsed time between failures of a system during normal operations. The failures could 
be caused by broken machines or computer errors, among other failures. Suppose that 
the MTBF for a new automated manufacturing system follows an exponential distribu- 
tion with a mean of 12.7 hours. 

a. What is the probability that the automated manufacturing system runs for more than 
15 hours without a failure? 

b. What is the probability that the automated manufacturing system runs for eight or 
fewer hours before failure? 

c. What is the probability that the automated manufacturing system runs for more than 
six hours but less than 10 hours before a failure? 

44. Bed and Breakfast Inns Website. A website for bed and breakfast inns gets approx- 
imately seven visitors per minute. Suppose the number of website visitors per minute 
follows a Poisson probability distribution. 

a. What is the mean time between visits to the website? 

b. Show the exponential probability density function for the time between website 
visits. 

c. What is the probability that no one will access the website in a 1-minute period? 

d. What is the probability that no one will access the website in a 12-second period? 

45. Waiting in Line at Kroger. Do you dislike waiting in line? Supermarket chain Kroger has 
used computer simulation and information technology to reduce the average waiting time 
for customers at 2,300 stores. Using a new system called Que Vision, which allows Kroger 
to better predict when shoppers will be checking out, the company was able to decrease 
average customer waiting time to just 26 seconds (InformationWeek website). Assume 
that waiting times at Kroger are exponentially distributed. 

a. Show the probability density function of waiting time at Kroger. 
b. What is the probability that a customer will have to wait between 15 and 
30 seconds? 
c. What is the probability that a customer will have to wait more than 2 minutes? 

46. Calls to Insurance Claims Office. The time (in minutes) between telephone calls at 

an insurance claims office has the following exponential probability distribution. 


fx) = 500 ~~ forx=0 


. What is the mean time between telephone calls? 

. What is the probability of having 30 seconds or less between telephone calls? 
. What is the probability of having 1 minute or less between telephone calls? 

. What is the probability of having 5 or more minutes without a telephone call? 


aaoo 


CASE PROBLEM 1: SPECIALTY TOYS 


Specialty Toys, Inc. sells a variety of new and innovative children’s toys. Management 
learned that the preholiday season is the best time to introduce a new toy, because many 
families use this time to look for new ideas for December holiday gifts. When Specialty 
discovers a new toy with good market potential, it chooses an October market entry date. 
In order to get toys into its stores by October, Specialty places one-time orders with its 
manufacturers in June or July of each year. Demand for children’s toys can be highly volatile. 
If a new toy catches on, a sense of shortage in the marketplace often increases the demand 
to high levels and large profits can be realized. However, new toys can also flop, leaving 
Specialty stuck with high levels of inventory that must be sold at reduced prices. The most 
important question the company faces is deciding how many units of a new toy should be 
purchased to meet anticipated sales demand. If too few are purchased, sales will be lost; if too 
many are purchased, profits will be reduced because of low prices realized in clearance sales. 
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For the coming season, Specialty plans to introduce a new product called Weather 
Teddy. This variation of a talking teddy bear is made by a company in Taiwan. When a 
child presses Teddy’s hand, the bear begins to talk. A built-in barometer selects one of five 
responses that predict the weather conditions. The responses range from “It looks to be a 
very nice day! Have fun” to “I think it may rain today. Don’t forget your umbrella.” Tests 
with the product show that, even though it is not a perfect weather predictor, its predictions 
are surprisingly good. Several of Specialty’s managers claimed Teddy gave predictions of 
the weather that were as good as those of many local television weather forecasters. 

As with other products, Specialty faces the decision of how many Weather Teddy units 
to order for the coming holiday season. Members of the management team suggested order 
quantities of 15,000, 18,000, 24,000, or 28,000 units. The wide range of order quantities 
suggested indicates considerable disagreement concerning the market potential. The prod- 
uct management team asks you for an analysis of the stock-out probabilities for various 
order quantities, an estimate of the profit potential, and help with making an order quantity 
recommendation. Specialty expects to sell Weather Teddy for $24 based on a cost of $16 
per unit. If inventory remains after the holiday season, Specialty will sell all surplus inven- 
tory for $5 per unit. After reviewing the sales history of similar products, Specialty’s senior 
sales forecaster predicted an expected demand of 20,000 units with a .95 probability that 
demand would be between 10,000 units and 30,000 units. 


Managerial Report 
Prepare a managerial report that addresses the following issues and recommends an order 
quantity for the Weather Teddy product. 


1. Use the sales forecaster’s prediction to describe a normal probability distribution that 
can be used to approximate the demand distribution. Sketch the distribution and show 
its mean and standard deviation. 

2. Compute the probability of a stock-out for the order quantities suggested by members 
of the management team. 

3. Compute the projected profit for the order quantities suggested by the management 
team under three scenarios: worst case in which sales = 10,000 units, most likely case 
in which sales = 20,000 units, and best case in which sales = 30,000 units. 

4. One of Specialty’s managers felt that the profit potential was so great that the order 
quantity should have a 70% chance of meeting demand and only a 30% chance of any 
stock-outs. What quantity would be ordered under this policy, and what is the projected 
profit under the three sales scenarios? 

5. Provide your own recommendation for an order quantity and note the associated profit 
projections. Provide a rationale for your recommendation. 


CASE PROBLEM 2: GEBHARDT ELECTRONICS 
Gebhardt Electronics produces a wide variety of transformers that it sells directly to 
manufacturers of electronics equipment. For one component used in several models of its 
transformers, Gebhardt uses a 3-foot length of .20-mm diameter solid wire made of pure 
Oxygen-Free Electronic (OFE) copper. A flaw in the wire reduces its conductivity and 
increases the likelihood it will break, and this critical component is difficult to reach and 
repair after a transformer has been constructed. Therefore, Gebhardt wants to use primarily 
flawless lengths of wire in making this component. The company is willing to accept 
no more than a | in 20 chance that a 3-foot length taken from a spool will be flawless. 
Gebhardt also occasionally uses smaller pieces of the same wire in the manufacture of 
other components, so the 3-foot segments to be used for this component are essentially 
taken randomly from a long spool of .20-mm diameter solid OFE copper wire. 

Gebhardt is now considering a new supplier for copper wire. This supplier claims that its 
spools of .20-mm diameter solid OFE copper wire average 50 inches between flaws. Gebhardt 
now must determine whether the new supply will be satisfactory if the supplier’s claim is valid. 
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Managerial Report 
In making this assessment for Gebhardt Electronics, consider the following three questions: 


1. If the new supplier does provide spools of .20-mm solid OFE copper wire that average 
50 inches between flaws, how is the length of wire between two consecutive flaws 
distributed? 

2. Using the probability distribution you identified in (1), what is the probability that 
Gebhardt’s criteria will be met (i.e., a 1 in 20 chance that a randomly selected 3-foot 
segment of wire provided by the new supplier will be flawless)? 

3. In inches, what is the minimum mean length between consecutive flaws that would 
result in satisfaction of Gebhardt’s criteria? 

4. In inches, what is the minimum mean length between consecutive flaws that would 
result in a 1 in 100 chance that a randomly selected 3-foot segment of wire provided by 
the new supplier will be flawless? 
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STATISTICS IN PRACTICE 


The Food and Agriculture Organization 
ROME, ITALY 


The Food and Agriculture Organization (FAO) is a spe- 
cialized agency of the United Nations that leads inter- 
national efforts to achieve food security and ensure that 
people have regular access to sufficient high-quality 
food to lead active, healthy lives. The FAO includes 
over 190 member states and the organization is active 
in over 130 countries worldwide. 

As part of its mission, the FAO engages in national 
forest assessments that require it to gather information 
that is useful in efforts to fight deforestation and the 
resulting land degradation (which reduces a nation’s 
ability to grow food crops). These national forest 
assessments require reliable and accurate informa- 
tion about a nation’s timberlands and forests. What 
is the present volume in the forests? What is the past 
growth of the forests? What is the projected future 
growth of the forests? With answers to these impor- 
tant questions, the FAO can assess the conditions of a 
nation’s forest inventory, assist the nation in developing 
strategies and plans for the future (including long-term 
planting and harvesting schedules for the trees), and 
slow or eliminate deforestation and land degradation. 

The FAO uses a cluster sampling methodology to 
obtain the information it needs about a nation’s vast 
forest holdings. Once the forest population is defined, 
it is divided into plots of a prespecified size and shape 
and the variables to be measured are identified. A 
random sample of these plots is selected, and mea- 
surements are taken from each selected plot. 

The use of cluster sampling reduces the travel re- 
quired for data collection while still providing a sample 


Random sampling of its forest holdings enables 
the FAO assist nations in the management of their forest 
inventories. © Robert Crum/Shutterstock.com 


that enables the FAO to obtain national estimates of 
the total area of forest categorized into various forest 
types and conditions. The FAO also collects the wood 
volume and distribution of trees by species and size, 
estimates of changes in forest attributes, and indicators 
of biodiversity. The sampling plan further allows the 
FAO to obtain sufficiently precise estimates for selected 
geographic regions, collect sufficient information to 
satisfy international reporting requirements, and achieve 
an acceptable balance between cost and precision. 

In this chapter, you will learn about various random 
sampling (including cluster sampling) and the sample 
selection process. In addition, you will learn how statis- 
tics such as the sample mean and sample proportion 
are used to estimate the population mean and popu- 
lation proportion. The important concept of sampling 
distribution is also introduced. 


McRoberts, R.E., Tomppo, E.O., Czaplewski, R.L. “Sampling Designs for National Forest Assessments”, Food and Agriculture Organization of the 
United Nations, http://www.fao.org/forestry/44859-02cf95ef26dfdcb86c6be27 20f8b938a8.pdf. 


In Chapter 1 we presented the following definitions of an element, a population, and a sample. 


e An element is the entity on which data are collected. 
e A population is the collection of all the elements of interest. 
e A sample is a subset of the population. 
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A sample mean provides 
an estimate of a popula- 
tion mean, and a sample 
proportion provides an 
estimate of a population 
proportion. With esti- 
mates such as these, some 
estimation error can be 
expected. This chapter 
provides the basis for 
determining how large that 
error might be. 


7.1 The Electronics Associates Sampling Problem 307 


The reason we select a sample is to collect data to make an inference and answer research 
questions about a population. 

Let us begin by citing two examples in which sampling was used to answer a research 
question about a population. 


1. Members of a political party in Texas were considering supporting a particular 
candidate for election to the U.S. Senate, and party leaders wanted to estimate the 
proportion of registered voters in the state favoring the candidate. A sample of 
400 registered voters in Texas was selected and 160 of the 400 voters indicated a 
preference for the candidate. Thus, an estimate of the proportion of the population 
of registered voters favoring the candidate is 160/400 = .40. 

2. A tire manufacturer is considering producing a new tire designed to provide an 
increase in mileage over the firm’s current line of tires. To estimate the mean useful 
life of the new tires, the manufacturer produced a sample of 120 tires for testing. 
The test results provided a sample mean of 36,500 miles. Hence, an estimate of the 
mean useful life for the population of new tires was 36,500 miles. 


It is important to realize that sample results provide only estimates of the values of the 
corresponding population characteristics. We do not expect exactly .40, or 40%, of the 
population of registered voters to favor the candidate, nor do we expect the sample mean of 
36,500 miles to exactly equal the mean mileage for the population of all new tires produced. 
The reason is simply that the sample contains only a portion of the population. Some sam- 
pling error is to be expected. With proper sampling methods, the sample results will provide 
“good” estimates of the population parameters. But how good can we expect the sample 
results to be? Fortunately, statistical procedures are available for answering this question. 

Let us define some of the terms used in sampling. The sampled population is the 
population from which the sample is drawn, and a frame is a list of the elements from 
which the sample will be selected. In the first example, the sampled population is all reg- 
istered voters in Texas, and the frame is a list of all the registered voters in Texas. Because 
the number of registered voters in Texas is a finite number, the first example is an illustra- 
tion of sampling from a finite population. In Section 7.2, we discuss how a simple random 
sample can be selected when sampling from a finite population. 

The sampled population for the tire mileage example is more difficult to define because 
the sample of 120 tires was obtained from a production process at a particular point in 
time. We can think of the sampled population as the conceptual population of all the tires 
that could have been made by the production process at that particular point in time. In this 
sense the sampled population is considered infinite, making it impossible to construct a 
frame from which to draw the sample. In Section 7.2, we discuss how to select a random 
sample in such a situation. 

In this chapter, we show how simple random sampling can be used to select a sample 
from a finite population and describe how a random sample can be taken from an infinite 
population that is generated by an ongoing process. We then show how data obtained from 
a sample can be used to compute estimates of a population mean, a population standard 
deviation, and a population proportion. In addition, we introduce the important concept of a 
sampling distribution. As we will show, knowledge of the appropriate sampling distribution 
enables us to make statements about how close the sample estimates are to the correspond- 
ing population parameters. In Section 7.7, we discuss some alternatives to simple random 
sampling that are often employed in practice. In the last section, we discuss sampling and 
nonsampling error, and how these relate to large samples. 


7.1 The Electronics Associates Sampling Problem 


The director of personnel for Electronics Associates, Inc. (EAI) has been assigned the task 
of developing a profile of the company’s 2500 employees. The characteristics to be iden- 
tified include the mean annual salary for the employees and the proportion of employees 
having completed the company’s management training program. 
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; Using the 2500 employees as the population for this study, we can find the annual salary 
DATAf ile and the training program status for each individual by referring to the firm’s personnel 
records. The data set containing this information for all 2500 employees in the population 
is in the file EAT. 

Using the EAI data and the formulas presented in Chapter 3, we compute the population 
mean and the population standard deviation for the annual salary data. 


Population mean: u = $71,800 


Population standard deviation: o = $4000 


The data for the training program status show that 1500 of the 2500 employees completed 
the training program. 
Numerical characteristics of a population are called parameters. Letting p denote 
the proportion of the population that completed the training program, we see that 
p = 1500/2500 = .60. The population mean annual salary (u = $71,800), the popula- 
tion standard deviation of annual salary (o = $4000), and the population proportion 
that completed the training program (p = .60) are parameters of the population of EAT 
employees. 
Now, suppose that the necessary information on all the EAI employees was not readily 
Often the cost of collecting available in the company’s database. The question we now consider is how the firm’s 
information from a sample director of personnel can obtain estimates of the population parameters by using a 
is substantially less than sample of employees rather than all 2500 employees in the population. Suppose that 
from a population, espe- a Sample of 30 employees will be used. Clearly, the time and the cost of developing a 
cially when personal inter- profile would be substantially less for 30 employees than for the entire population. If 
views must be conducted the personnel director could be assured that a sample of 30 employees would provide 
to collect the information. adequate information about the population of 2500 employees, working with a sample 
would be preferable to working with the entire population. Let us explore the possibility 
of using a sample for the EAI study by first considering how we can identify a sample 
of 30 employees. 


7.2 Selecting a Sample 


In this section we describe how to select a sample. We first describe how to sample from a 
finite population and then describe how to select a sample from an infinite population. 


Sampling from a Finite Population 


Statisticians recommend selecting a probability sample when sampling from a finite 

population because a probability sample allows them to make valid statistical inferences 

about the population. The simplest type of probability sample is one in which each 
Other methods of sample of size n has the same probability of being selected. It is called a simple random 
probability sampling are sample. A simple random sample of size n from a finite population of size N is defined 
described in Section 7.7.. as follows. 


SIMPLE RANDOM SAMPLE (FINITE POPULATION) 


A simple random sample of size n from a finite population of size N is a sample 
selected such that each possible sample of size n has the same probability of being 
The random numbers selected. 
generated using Excel's 
RAND function follow a 


uniform probability The procedures used to select a simple random sample from a finite population are 
distribution between based upon the use of random numbers. We can use Excel’s RAND function to generate 
O and 1. a random number between 0 and | by entering the formula =RAND() into any cell in a 
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worksheet. The number generated is called a random number because the mathematical 
procedure used by the RAND function guarantees that every number between 0 and 1 has 
the same probability of being selected. Let us see how these random numbers can be used 
to select a simple random sample. 

Our procedure for selecting a simple random sample of size n from a population of 
size N involves two steps. 


Step 1. Assign a random number to each element of the population. 
Step 2. Select the n elements corresponding to the n smallest random numbers. 


Because each set of n elements in the population has the same probability of being as- 
signed the n smallest random numbers, each set of n elements has the same probability of 
being selected for the sample. If we select the sample using this two-step procedure, every 
sample of size n has the same probability of being selected; thus, the sample selected satis- 
fies the definition of a simple random sample. 

Let us consider an example involving selecting a simple random sample of size 
n = 5 from a population of size N = 15. Table 7.1 contains a list of the 15 teams 
in the 2019 National Baseball League. Suppose we want to select a simple random 
sample of 5 teams to conduct in-depth interviews about how they manage their 
minor league franchises. 

Step 1 of our simple random sampling procedure requires that we assign a random 
number to each of the 15 teams in the population. Figure 7.1 shows a worksheet used 
to generate a random number corresponding to each of the 15 teams in the population. 
The names of the baseball teams are in column A, and the random numbers generated 
are in column B. From the formula worksheet in the background we see that the formula 
=RAND( has been entered in cells B2:B16 to generate the random numbers between 0 
and 1. From the value worksheet in the foreground we see that Arizona is assigned the ran- 
dom number .850862, Atlanta has been assigned the random number .706245, and so on. 

The second step is to select the five teams corresponding to the five smallest random 
numbers as our sample. Looking through the random numbers in Figure 7.1, we see that 
the team corresponding to the smallest random number (.066942) is Washington, and 
that the four teams corresponding to the next four smallest random numbers are Miami, 
San Francisco, St. Louis, and New York. Thus, these five teams make up the simple 
random sample. 

Searching through the list of random numbers in Figure 7.1 to find the five smallest ran- 
dom numbers is tedious, and it is easy to make mistakes. Excel’s Sort procedure simplifies 
this step. We illustrate by sorting the list of baseball teams in Figure 7.1 to find the five 
teams corresponding to the five smallest random numbers. Refer to the foreground work- 
sheet in Figure 7.1 as we describe the steps involved. 


Step 1. Select any cell in the range B2:B16 

Step 2. Click the Home tab on the Ribbon 

Step 3. In the Editing group, click Sort & Filter 
Step 4. Choose Sort Smallest to Largest 


TABLE 7.1 2019 National Baseball League Teams 


Arizona Los Angeles Pittsburgh 
Atlanta Miami San Diego 
Chicago Milwaukee San Francisco 
Cincinnati New York St. Louis 
Colorado Philadelphia Washington 
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FIGURE 7.1 Worksheet Used to Generate a Random Number 
Corresponding to Each Team 


A B | cC 
Random 
1 |Team | Numbers 
2 |Arizona =RAND() 
3 |Atlanta =RAND() 
4 Chicago =RAND() 
5 | Cincinnati =RAND() 
6 Colorado =RAND() z a z 
T posangi Random 
e i : — EERE 1 = | Numbers 
SS DATA file a ber ton enna 
NationalLeague i - 3 Atlanta 0.706245 
11 Philadelphia =RAND() I chicago rete 
z parE i 0 5 |Cincinnati 0.614784 
a2 San ass k ) 6 [Colorado 0.553815 
iF Franciaco Baas. 7 |Los Angeles 0.857324 
h 8 Miami 0.179123 
ay Washington ) 9 |Milwaukee 0.525636 
- | 10 New York 0.471490 


_ 
_ 


Philadelphia 0.523103 
12 |Pittsburgh 0.851552 
San Diego 0.806185 
14 San Francisco 0.327713 


m 
i) 


15 St. Louis 0.374168 
16 Washington 0.066942 
17 | 
18 


After completing these steps we obtain the worksheet shown in Figure 7.2.' The teams 
listed in rows 2-6 are the ones corresponding to the smallest five random numbers; they 
are our simple random sample. Note that the random numbers shown in Figure 7.2 are in 

The Excel Sort proced- ascending order, and that the teams are not in their original order. For instance, Washington 
ure for identifying the is the last team listed in Figure 7.1, but it is the first team selected in the simple random 
employees associated with sample. Miami, the second team in our sample, is the seventh team in the original list, and 
the 30 smallest random so on. 


numbers is especially We now use this simple random sampling procedure to select a simple random sample 
valuable with such a large Of 30 EAI employees from the population of 2500 EAI employees. We begin by generat- 
population. ing 2500 random numbers, one for each employee in the population. Then we select the 


‘In order to show the random numbers from Figure 7.1 in ascending order in this worksheet, we turned off the auto- 
matic recalculation option prior to sorting for illustrative purposes. If the recalculation option were not turned off, a 
new set of random numbers would have been generated when the sort was completed. 
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FIGURE 7.2 Using Excel's Sort Procedure to Select the Simple Random 
Sample of Five Teams 


A a i 
Random 
1 |Team | Numbers 
2 Washington 0.066942 
3 Miami 0.179123 
4 |San Francisco 0.327713 
5 St. Louis 0.374168 
6 New York 0.471490 
7 Philadelphia 0.523103 
8 Milwaukee 0.525636 
9 Colorado 0.553815 
10 (Cincinnati 0.614784 


11 Atlanta 0.706245 
12 Chicago 0.724789 
13 San Diego 0.806185 
14 Arizona 0.850862 


15 Pittsburgh 0.851552 
16 Los Angeles 0.857324 
17| 
18 


30 employees corresponding to the 30 smallest random numbers as our sample. Refer to 
Figure 7.3 as we describe the steps involved. 


Enter/Access Data: Open the file EA/. The first three columns of the worksheet in the 
background show the annual salary data and training program status for the first 30 em- 
ployees in the population of 2500 EAI employees. (The complete worksheet contains all 
2500 employees.) 


Enter Functions and Formulas: In the background worksheet, the label Random 
Numbers has been entered in cell D1 and the formula =RAND() has been entered in 
cells D2:D2501 to generate a random number between 0 and 1 for each of the 2500 EAI 
employees. The random number generated for the first employee is 0.613872, the random 
number generated for the second employee is 0.473204, and so on. 


Apply Tools: All that remains is to find the employees associated with the 30 smallest 
random numbers. To do so, we sort the data in columns A through D into ascending order 
by the random numbers in column D. 


Step 1. Select any cell in the range D2:D2501 
Step 2. Click the Home tab on the Ribbon 

Step 3. In the Editing group, click Sort & Filter 
Step 4. Choose Sort Smallest to Largest 


After completing these steps we obtain the worksheet shown in the foreground of 
Figure 7.3. The employees listed in rows 2-31 are the ones corresponding to the smallest 
30 random numbers that were generated. Hence, this group of 30 employees is a simple random 
sample. Note that the random numbers shown in the foreground of Figure 7.3 are in ascending 
order, and that the employees are not in their original order. For instance, employee 812 in the 
population is associated with the smallest random number and is the first element in the sample, 
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FIGURE 7.3 Using Excel to Select a Simple Random Sample 


A B cS p l e | Ee e Je 


Annual Training Random 


1 [Manager | Salary Program Numbers 
2 1 75769.50 No 0.613872 
3 | 70823.00 Yes 0.473204 
4 3 68408.20 No 0.54901 1 The formula in cells 
6 5 72801.60 Yes 0.531085 
7 6 71767.70 No 0.994296 
8| 7 | 78346.60 Yes 0.189065 7 Ae = E | 
9 1 8 66670.20 No 0.020714 Annual Training Random 
10 9 70246.80 Yes 0.647318 1 | Manager Salary Program Numbers 
11 10 71255.00 No 0.524341 2 812. 69094.30 Yes 0.000193 
12| 11 | 72546.60 No 0.764998 3| 1411 G26 0.000484 
13| 12 | 69512.50 Yes 0.255244 4| 1795 — 69643.50 Yes 0.002641 
14 | 13 71753.00 Yes 0.010923 5 | 2095 69894.90 Yes 0.002763 
16 15 68052.20 No 0.635675 7 | 744 75924.00 Yes 0.002977 
17 16 64652.50 Yes 0.177294 8 | 470 69092.30 Yes 0.003182 
‘| 17 [i 0.415097 9| 1606. 71404.40 Yes 0.003448 
19| 18 [65187.80 Yes 0.883440 410] 1744. 70957.70 Yes 0.004203 
20| 19 [ie 0.476824 1; 179 -——-75109.70 Yes 0.005293 
21 20 73706.30 Yes 0.101065 12| 1387 65922.60 Yes 0.005709 
2| 21 72039.50 Yes 0.775323 43| 1782|! 77268.40 No 0.005729 
2 22 n 0.011729 44) 1006. 75688.80 Yes 0.005796 
24 | 23 73372.50 No 0.762026 15. 278 71564.70 No 0.005966 
| 24 0.066344 16| 1850. 76188.20 No 0.006250 
26 29 75738.10 Yes 0.776766 17. 844 71766.00 Yes 0.006708 
2| 26 [PO kE 0.828493 48] 2028| 72541.30 No 0.007767 
28 27 72386.20 Yes 0.841532 19 | 1654 64980.00 Yes 0.008095 
29° 28 71051.60 Yes 0.899427 20 | 444 71932.60 Yes 0.009686 
30, 29 | 72095.60 Yes 0.486284 24 556 72973.00 Yes 0.009711 
31) 30  Ø64956S0N 0.264628 22) 2449. 65120.90 Yes 0.010595 
23| 13 71753.00 Yes 0.010923 
24 | 2187 74391.80 No 0.011364 
25) 1633 70164.20 No 0.011603 
Note: Rows 32-2501 are 26 22 72973.60 No 0.011729 
not shown. 27 | 1530 70241.30 No 0.013570 
28| 820 72793.90 No 0.013669 
29 1258 70979.40 Yes 0.014042 
| 2349 75860.90 Yes 0.014532 
31 1698 77309.10 No 0.014539 


and employee 13 in the population (see row 14 of the background worksheet) has been included 
as the 22nd observation in the sample (row 23 of the foreground worksheet). 


Sampling from an Infinite Population 


Sometimes we want to select a sample from a population, but the population is infinitely 
large or the elements of the population are being generated by an ongoing process for 
which there is no limit on the number of elements that can be generated. Thus, it is not pos- 
sible to develop a list of all the elements in the population. This is considered the infinite 
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population case. With an infinite population, we cannot select a simple random sample be- 
cause we cannot construct a frame consisting of all the elements. In the infinite population 
case, statisticians recommend selecting what is called a random sample. 


RANDOM SAMPLE (INFINITE POPULATION) 


A random sample of size n from an infinite population is a sample selected such that 
the following conditions are satisfied. 


1. Each element selected comes from the same population. 
2. Each element is selected independently. 


Care and judgment must be exercised in implementing the selection process for 
obtaining a random sample from an infinite population. Each case may require a different 
selection procedure. Let us consider two examples to see what we mean by the conditions: 
(1) Each element selected comes from the same population and (2) each element is selected 
independently. 

A common quality control application involves a production process where there is 
no limit on the number of elements that can be produced. The conceptual population we 
are sampling from is all the elements that could be produced (not just the ones that are 
produced) by the ongoing production process. Because we cannot develop a list of all the 
elements that could be produced, the population is considered infinite. To be more spe- 
cific, let us consider a production line designed to fill boxes of a breakfast cereal with a 
mean weight of 24 ounces of breakfast cereal per box. Samples of 12 boxes filled by this 
process are periodically selected by a quality control inspector to determine if the process 
is Operating properly or if, perhaps, a machine malfunction has caused the process to begin 
underfilling or overfilling the boxes. 

With a production operation such as this, the biggest concern in selecting a random 
sample is to make sure that condition 1 (the sampled elements are selected from the same 
population) is satisfied. To ensure that this condition is satisfied, the boxes must be selected 
at approximately the same point in time. This way the inspector avoids the possibility of 
selecting some boxes when the process is operating properly and other boxes when the pro- 
cess is not operating properly and is underfilling or overfilling the boxes. With a production 
process such as this, the second condition (each element is selected independently) is satis- 
fied by designing the production process so that each box of cereal is filled independently. 
With this assumption, the quality control inspector only needs to worry about satisfying the 
same population condition. 

As another example of selecting a random sample from an infinite population, consider 
the population of customers arriving at a fast-food restaurant. Suppose an employee is asked 
to select and interview a sample of customers in order to develop a profile of customers who 
visit the restaurant. The customer arrival process is ongoing and there is no way to obtain a 
list of all customers in the population. So, for practical purposes, the population for this 
ongoing process is considered infinite. As long as a sampling procedure is designed so 
that all the elements in the sample are customers of the restaurant and they are selected 
independently, a random sample will be obtained. In this case, the employee collecting 
the sample needs to select the sample from people who come into the restaurant and make 
a purchase to ensure that the same population condition is satisfied. If, for instance, the 
employee selected someone for the sample who came into the restaurant just to use the 
restroom, that person would not be a customer and the same population condition would be 
violated. So, as long as the interviewer selects the sample from people making a purchase 
at the restaurant, condition | is satisfied. Ensuring that the customers are selected independ- 
ently can be more difficult. 

The purpose of the second condition of the random sample selection procedure (each 
element is selected independently) is to prevent selection bias. In this case, selection bias 
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would occur if the interviewer were free to select customers for the sample arbitrarily. The 
interviewer might feel more comfortable selecting customers in a particular age group and 
might avoid customers in other age groups. Selection bias would also occur if the inter- 
viewer selected a group of five customers who entered the restaurant together and asked all 
of them to participate in the sample. Such a group of customers would be likely to exhibit 
similar characteristics, which might provide misleading information about the population 
of customers. Selection bias such as this can be avoided by ensuring that the selection of a 
particular customer does not influence the selection of any other customer. In other words, 
the elements (customers) are selected independently. 

McDonald’s, the fast-food restaurant leader, implemented a random sampling proce- 
dure for this situation. The sampling procedure was based on the fact that some customers 
presented discount coupons. Whenever a customer presented a discount coupon, the next 
customer served was asked to complete a customer profile questionnaire. Because arriving 
customers presented discount coupons randomly and independently of other customers, 
this sampling procedure ensured that customers were selected independently. As a result, 
the sample satisfied the requirements of a random sample from an infinite population. 

Situations involving sampling from an infinite population are usually associated with a 
process that operates over time. Examples include parts being manufactured on a produc- 
tion line, repeated experimental trials in a laboratory, transactions occurring at a bank, 
telephone calls arriving at a technical support center, and customers entering a retail store. 
In each case, the situation may be viewed as a process that generates elements from an 
infinite population. As long as the sampled elements are selected from the same popula- 
tion and are selected independently, the sample is considered a random sample from an 
infinite population. 


NOTES + COMMENTS 


1. 


In this section we have been careful to define two types 
of samples: a simple random sample from a finite popula- 
tion and a random sample from an infinite population. In 
the remainder of the text, we will generally refer to both 
of these as either a random sample or simply a sample. 
We will not make a distinction of the sample being a 
“simple” random sample unless it is necessary for the ex- 
ercise or discussion. 

. Statisticians who specialize in sample surveys from finite 
populations use sampling methods that provide probabil- 
ity samples. With a probability sample, each possible sam- 
ple has a known probability of selection and a random pro- 
cess is used to select the elements for the sample. Simple 
random sampling is one of these methods. In Section 7.7, 
we describe some other probability sampling methods: 


EXERCISES 


Methods 


1. Consider a finite population with five elements labeled A, B, C, D, and E. Ten possible 


stratified random sampling, cluster sampling, and system- 
atic sampling. We use the term simple in simple random 
sampling to clarify that this is the probability sampling 
method that assures each sample of size n has the same 
probability of being selected. 


. The number of different simple random samples of size n 


that can be selected from a finite population of size N is 


N! 
nl(N — n)! 


In this formula, N! and n! are the factorial formulas dis- 
cussed in Chapter 4. For the EAI problem with N = 2500 
and n = 30, this expression can be used to show that ap- 
proximately 2.75 x 10° different simple random samples 
of 30 EAI employees can be obtained. 


simple random samples of size 2 can be selected. 
a. List the 10 samples beginning with AB, AC, and so on. 


b. Using simple random sampling, what is the probability that each sample of size 2 is 


selected? 
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c. Suppose we use Excel’s RAND function to assign random numbers to the 
five elements: A (.7266), B (.0476), C (.2459), D (.0957), E (.9408). List the 
simple random sample of size 2 that will be selected by using these random 
numbers. 
2. Assume a finite population has 10 elements. Number the elements from 1 to 10 and 
use the following 10 random numbers to select a sample of size 4. 


7545. .0936 .0341 .3242 1449 9060 .2420 .9773 5428 .0729 


Applications 
3. Randomly Sampling American League Teams. The 2019 American League 
consists of 15 baseball teams. Suppose a sample of 5 teams is to be selected to 
conduct player interviews. The following table lists the 15 teams and the random 
numbers assigned by Excel’s RAND function. Use these random numbers to select 
a sample of size 5. 


Random Random Random 

Team Number Team Number Team Number 

ap Baltimore 0.578370 Chicago 0.562178 Anaheim 0.895267 
== DATA fil 

R= fi e Boston 0.290197 Cleveland 0.960271 Houston 0.657999 

AmericanLeague New York 0.178624 Detroit 0.253574 Oakland 0.288287 

Tampa Bay 0.867778 Kansas City 0.326836 Seattle 0.839071 

Toronto 0.965807 Minnesota 0.811810 Texas 0.500879 


4. Randomly Sampling PGA Golfers. The U.S. Golf Association has instituted a ban 
on long and belly putters. This has caused a great deal of controversy among both 
amateur golfers and members of the Professional Golf Association (PGA). Shown 
below are the names of the top 10 finishers in the recent PGA Tour RSM Classic 
golf tournament. (Note that two golfers tied for fourth place and four golfers tied for 
seventh place.) 


1. Charles Howell III 6. Cameron Champ 
2. Patrick Rodgers 7. Zach Johnson 
3. Webb Simpson 8. Peter Uihlein 
4. Luke List 9. Chase Wright 
5. Ryan Blaum 10. Kevin Kisner 


Select a simple random sample of 3 of these players to assess their opinions on the use 
of long and belly putters. 


A DATA fil 5. Randomly Sampling EAI Employees. In this section we used a two-step procedure 
= J ue to select a simple random sample of 30 EAI employees. Use this procedure to select a 
EAI simple random sample of 50 EAI employees. 


6. Finite vs. Infinite Populations. Indicate which of the following situations involve 
sampling from a finite population and which involve sampling from an infinite 
population. In cases where the sampled population is finite, describe how you would 
construct a frame. 

a. Select a sample of licensed drivers in the state of New York. 
b. Select a sample of boxes of cereal off the production line for the Breakfast Choice 
Company. 

. Select a sample of cars crossing the Golden Gate Bridge on a typical weekday. 

d. Select a sample of students in a statistics course at Indiana University. 

e. Select a sample of the orders being processed by a mail-order firm. 


O 
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7.3 Point Estimation 


Now that we have described how to select a simple random sample, let us return to the EAI 
problem. A simple random sample of 30 employees and the corresponding data on annual 
salary and management training program participation are as shown in Table 7.2. The nota- 
tion x,, x», and so on is used to denote the annual salary of the first employee in the sample, 
the annual salary of the second employee in the sample, and so on. Participation in the man- 
agement training program is indicated by Yes in the management training program column. 

To estimate the value of a population parameter, we compute a corresponding character- 
istic of the sample, referred to as a sample statistic. For example, to estimate the popula- 
tion mean p and the population standard deviation ø for the annual salary of EAI em- 
ployees, we use the data in Table 7.2 to calculate the corresponding sample statistics: the 
sample mean and the sample standard deviation s. Using the formulas for a sample mean 
and a sample standard deviation presented in Chapter 3, the sample mean is 


Èx; 2,154,420 
X= n — 30 
and the sample standard deviation is 
(x; — x) 325,009,260 
s= 2 = = $3348 
n=] 29 
To estimate p, the proportion of employees in the population who completed the man- 
agement training program, we use the corresponding sample proportion p. Let x denote 
the number of employees in the sample who completed the management training program. 


The data in Table 7.2 show that x = 19. Thus, with a sample size of n = 30, the sample 
proportion is 


= $71,814 


By making the preceding computations, we perform the statistical procedure called 
point estimation. We refer to the sample mean x as the point estimator of the population 
mean u, the sample standard deviation s as the point estimator of the population standard 


TABLE 7.2 Annual Salary and Training Program Status for a Simple Random 


Sample of 30 EAI Managers 


Management Management 
Annual Salary ($) Training Program Annual Salary ($) Training Program 
x, = 69,094.30 Yes X14 = 71,766.00 Yes 
X> = 73,263.90 Yes X17 = 72,541.30 No 
x; = 69,643.50 Yes X13 = 64,980.00 Yes 
x, = 69,894.90 Yes X19 = 71,932.60 Yes 
x, = 67,621.60 No doy 72973 00 Yes 
x, = 75,924.00 Yes xX = 65,120.90 Yes 
x = 69,092.30 Yes Xo = 71,753.00 Yes 
Xg = 71,404.40 Yes X>3 = 74,391.80 No 
X = 70,957.70 Yes X>q = 70,164.20 No 
Kio = 75,109.70 Yes Xs = 72,973.60 No 
X11 = 65,922.60 Yes Xo = 10,241.30 No 
X47 = 77,268.40 No Xo 7279390 No 
X13 = 75,688.80 Yes X53 = 70,979.40 Yes 
X14 = 71,564.70 No Xə% = 75,860.90 Yes 
X5 = 76,188.20 No X39 = 77,309.10 No 
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In chapter 8, we discuss 
how to construct an inter- 
val estimate in order to 
provide information about 
how close the point esti- 
mate is to the population 
parameter. 
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TABLE 7.3 Summary of Point Estimates Obtained from a Simple Random 


Sample of 30 EAI Employees 


Population Parameter Parameter Value Point Estimator Point Estimate 


u = Population mean $71,800 x = Sample mean $71,814 
annual salary annual salary 

a = Population standard $4000 s = Sample standard $3348 
deviation for annual deviation for 
salary annual salary 

p = Population proportion -60 p = Sample proportion 63 


having completed the having completed 
management training 


program 


the management 
training program 


deviation ø, and the sample proportion p as the point estimator of the population propor- 
tion p. The numerical value obtained for x, s, or p is called the point estimate. Thus, for 
the simple random sample of 30 EAI employees shown in Table 7.2, $71,814 is the point 
estimate of u, $3348 is the point estimate of ø, and .63 is the point estimate of p. Table 7.3 
summarizes the sample results and compares the point estimates to the actual values of the 
population parameters. 

As is evident from Table 7.3, the point estimates differ somewhat from the correspond- 
ing population parameters. This difference is to be expected because a sample, and not a 
census of the entire population, is being used to develop the point estimates. 


Practical Advice 


The subject matter of much of the rest of the book is concerned with statistical infer- 
ence. Point estimation is a form of statistical inference. We use a sample statistic to 
make an inference about a population parameter. When making inferences about a 
population based on a sample, it is important to have a close correspondence between 
the sampled population and the target population. The target population is the population 
we want to make inferences about, while the sampled population is the population from 
which the sample is actually taken. In this section, we have described the process of 
drawing a simple random sample from the population of EAI employees and making 
point estimates of characteristics of that same population. So the sampled population 
and the target population are identical, which is the desired situation. But in other 
cases, it is not as easy to obtain a close correspondence between the sampled and 
target populations. 

Consider the case of an amusement park selecting a sample of its customers to learn 
about characteristics such as age and time spent at the park. Suppose all the sample ele- 
ments were selected on a day when park attendance was restricted to employees of a large 
company. Then the sampled population would be composed of employees of that company 
and members of their families. If the target population we wanted to make inferences about 
were typical park customers over a typical summer, then we might encounter a significant 
difference between the sampled population and the target population. In such a case, we 
would question the validity of the point estimates being made. Park management would 
be in the best position to know whether a sample taken on a particular day was likely to be 
representative of the target population. 

In summary, whenever a sample is used to make inferences about a population, we 
should make sure that the study is designed so that the sampled population and the target 
population are in close agreement. Good judgment is a necessary ingredient of sound 
statistical practice. 
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EXERCISES 


Methods 
7. The following data are from a simple random sample. 


5 8 10 7 10 14 


a. What is the point estimate of the population mean? 
b. What is the point estimate of the population standard deviation? 

8. A survey question for a sample of 150 individuals yielded 75 Yes responses, 55 No 
responses, and 20 No Opinions. 
a. What is the point estimate of the proportion in the population who respond Yes? 
b. What is the point estimate of the proportion in the population who respond No? 


Applications 
9. Monthly Sales Data. A simple random sample of 5 months of sales data provided the 
following information: 


Month: 1 2 3 4 5 
Units Sold: 94 100 85 94 92 


a. Develop a point estimate of the population mean number of units sold per month. 
b. Develop a point estimate of the population standard deviation. 


= . 10. Morningstar Stock Data. Morningstar publishes ratings data on 1208 company 

S DATA file t piblihes t 

= stocks. A sample of 40 of these stocks is contained in the file Morningstar. Use the 
Morningstar data set to answer the following questions. 


a. Develop a point estimate of the proportion of the stocks that receive Morningstar’s 
highest rating of 5 Stars. 

b. Develop a point estimate of the proportion of the Morningstar stocks that are rated 
Above Average with respect to business risk. 

c. Develop a point estimate of the proportion of the Morningstar stocks that are rated 
2 Stars or less. 

11. Rating Wines. According to Wine-Searcher, wine critics generally use a wine-scoring 
scale to communicate their opinions on the relative quality of wines. Wine scores 
range from 0 to 100, with a score of 95—100 indicating a great wine, 90-94 indicating 
an outstanding wine, 85—89 indicating a very good wine, 80-84 indicating a good 
wine, 75-79 indicating a mediocre wine, and below 75 indicating that the wine is not 
recommended. Random ratings of a pinot noir recently produced by a newly estab- 
lished vineyard in 2018 follow: 


87 91 86 82 72 91 
60 77 80 79 83 96 


a. Develop a point estimate of mean wine score for this pinot noir. 

b. Develop a point estimate of the standard deviation for wine scores received by this 
pinot noir. 

12. AARP Survey. In a sample of 426 U.S. adults age 50 and older, AARP asked how 
important a variety of issues were in choosing whom to vote for in the most recent 
presidential election. 

a. What is the sampled population for this study? 

b. Social Security and Medicare were cited as “very important” by 350 respondents. 
Estimate the proportion of the population of U.S. adults age 50 and over who 
believe this issue is very important. 

c. Education was cited as “very important” by 74% of the respondents. Estimate the 
number of respondents who believe this issue is very important. 

d. Job Growth was cited as “very important” by 354 respondents. Estimate the propor- 
tion of U.S. adults age 50 and over who believe job growth is very important. 

e. What is the target population for the inferences being made in parts (b) and (d)? Is 
it the same as the sampled population you identified in part (a)? Suppose you later 
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learn that the sample was restricted to members of the AARP. Would you still feel 
the inferences being made in parts (b) and (d) are valid? Why or why not? 

13. Teen Attitudes on Social Media. In 2018, the Pew Internet & American Life Project 
asked 743 teens aged 13 to 17 several questions about their attitudes toward social 
media. The results showed that 602 say social media makes them feel more connected 
to what is going on in their friends’ lives; 513 say social media helps them interact 
with a more diverse group of people; and 275 feel pressure to post content that will get 
a lot of likes and comments. 

a. Develop a point estimate of the proportion of teens aged 13 to 17 who say social 
media makes them feel more connected to what is going on in their friends’ lives. 

b. Develop a point estimate of the proportion of teens aged 13 to 17 who say social 
media helps them interact with a more diverse group of people. 

c. Develop a point estimate of the proportion of teens aged 13 to 17 who feel pressure 
to post content that will get a lot of likes and comments. 

: 14. Point Estimates for EAI Employees. In this section we showed how a simple random 
DATAf ile sample of 30 EAI employees can be used to develop point estimates of the population 
mean annual salary, the population standard deviation for annual salary, and the popu- 
lation proportion having completed the management training program. 

a. Use Excel to select a simple random sample of 50 EAI employees. 

b. Develop a point estimate of the mean annual salary. 

c. Develop a point estimate of the population standard deviation for annual salary. 

d. Develop a point estimate of the population proportion having completed the man- 
agement training program. 


7.4 Introduction to Sampling Distributions 


In the preceding section we said that the sample mean x is the point estimator of the popula- 
tion mean p, and the sample proportion p is the point estimator of the population proportion 
p. For the simple random sample of 30 EAI employees shown in Table 7.2, the point esti- 
mate of u is x = $71,814 and the point estimate of p is p = .63. Suppose we select another 
simple random sample of 30 EAI employees and obtain the following point estimates: 


Sample mean: x = $72,670 
Sample proportion: p = .70 


Note that different values of x and p were obtained. Indeed, a second simple random 
sample of 30 EAI employees cannot be expected to provide the same point estimates as the 
first sample. 

Now, suppose we repeat the process of selecting a simple random sample of 30 EAI em- 
ployees over and over again, each time computing the values of x and p. Table 7.4 contains 


TABLE 7.4 Values of x and p from 500 Simple Random Samples 
of 30 EAI Employees 


Sample Number Sample Mean (x) Sample Proportion (p) 
1 71,814 .63 
2 72,670 70 
3 71,780 .67 
4 71,588 53 
500 IN TSZ -50 
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TABLE 7.5 Frequency and Relative Frequency Distributions of x from 


500 Simple Random Samples of 30 EAl Employees 


Mean Annual Salary ($) Frequency Relative Frequency 
69,500.00-69,999.99 2 .004 
70,000.00-70,499.99 16 1032 
70,500.00-70,999.99 52 .104 
71,000.00-71,499.99 101 -202 
71,500.00-71,999.99 133 .266 
72,000.00-72,499.99 110 .220 
72,500.00-72,999.99 54 .108 
73,000.00-73,499.99 26 .052 
73,500.00-73,999.99 6 012 

Totals 500 1.000 


a portion of the results obtained for 500 simple random samples, and Table 7.5 shows the 
The ability to understand frequency and relative frequency distributions for the 500 x values. Figure 7.4 shows the 
the material in subsequent relative frequency histogram for the x values. 
chapters depends heavily In Chapter 5 we defined a random variable as a numerical description of the outcome 
on the ability to under- of an experiment. If we consider the process of selecting a simple random sample as an 
stand and use the sampling experiment, the sample mean x is the numerical description of the outcome of the experi- 
distributions presented in ment. Thus, the sample mean x is a random variable. As a result, just like other random vari- 
this chapter. ables, x has a mean or expected value, a standard deviation, and a probability distribution. 
Because the various possible values of x are the result of different simple random samples, 


Relative Frequency Histogram of x Values from 500 Simple 


Random Samples of Size 30 Each 


25 


20 


Relative Frequency 


.05 


70,000 71,000 72,000 73,000 74,000 
Values of x 
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the probability distribution of x is called the sampling distribution of x. Knowledge of this 
sampling distribution and its properties will enable us to make probability statements about 
how close the sample mean x is to the population mean wp. 

Let us return to Figure 7.4. We would need to enumerate every possible sample of 
30 employees and compute each sample mean to completely determine the sampling 
distribution of x. However, the histogram of 500 x values gives an approximation of this 
sampling distribution. From the approximation we observe the bell-shaped appearance of 
the distribution. We note that the largest concentration of the x values and the mean of the 
500 x values is near the population mean u = $71,800. We will describe the properties of 
the sampling distribution of x more fully in the next section. 

The 500 values of the sample proportion p are summarized by the relative frequency 
histogram in Figure 7.5. As in the case of x, p is a random variable. If every possible 
sample of size 30 were selected from the population and if a value of p were computed for 
each sample, the resulting probability distribution would be the sampling distribution of p. 
The relative frequency histogram of the 500 sample values in Figure 7.5 provides a general 
idea of the appearance of the sampling distribution of p. 

In practice, we select only one simple random sample from the population. We repeated 
the sampling process 500 times in this section simply to illustrate that many different 
samples are possible and that the different samples generate a variety of values for the sam- 
ple statistics x and p. The probability distribution of any particular sample statistic is called 
the sampling distribution of the statistic. In Section 7.5 we discuss the characteristics of 
the sampling distribution of x. In Section 7.6 we discuss the characteristics of the sampling 
distribution of p. 


FIGURE 7.5 Relative Frequency Histogram of p Values from 500 Simple 


Random Samples of Size 30 Each 


40 


3) 


30 


25 


20 


Relative Frequency 
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7.5 Sampling Distribution of x 


In the previous section we said that the sample mean x is a random variable and its proba- 
bility distribution is called the sampling distribution of x. 


SAMPLING DISTRIBUTION OF x 


The sampling distribution of x is the probability distribution of all possible values of 
the sample mean x. 


This section describes the properties of the sampling distribution of x. Just as with other 
probability distributions we studied, the sampling distribution of x has an expected value or 
mean, a standard deviation, and a characteristic shape or form. Let us begin by considering 
the mean of all possible x values, which is referred to as the expected value of x. 


Expected Value of x 


In the EAI sampling problem we saw that different simple random samples result in a vari- 
ety of values for the sample mean x. Because many different values of the random variable 
x are possible, we are often interested in the mean of all possible values of x that can be 
generated by the various simple random samples. The mean of the x random variable is the 
expected value of x. Let E(x) represent the expected value of x and u represent the mean of 
the population from which we are selecting a simple random sample. It can be shown that 
with simple random sampling, E(x) and u are equal. 


EXPECTED VALUE OF x 


The expected value of EQ) =i (7.1) 


x equals the mean of the 
population from which the where 


sampieiisselectea, E(x) = the expected value of x 


u = the population mean 


This result shows that with simple random sampling, the expected value or mean of the 
sampling distribution of x is equal to the mean of the population. In Section 7.1 we saw 
that the mean annual salary for the population of EAI employees is u = $71,800. Thus, 
according to equation (7.1), the mean of all possible sample means for the EAI study is 
also $71,800. 

When the expected value of a point estimator equals the population parameter, we say 
the point estimator is unbiased. Thus, equation (7.1) shows that x is an unbiased estimator 
of the population mean p. 


Standard Deviation of x 
Let us define the standard deviation of the sampling distribution of x. We will use the 


following notation. 


o = the standard deviation of x 
o = the standard deviation of the population 
n = the sample size 


N = the population size 
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Exercise 17 shows that 
when n/N < .05, the finite 
population correction 
factor has little effect on 


the value of o. 


The term standard error is 
used throughout statistical 
inference to refer to the 
standard deviation of a 


point estimator. 
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It can be shown that the formula for the standard deviation of x depends on whether the 
population is finite or infinite. The two formulas for the standard deviation of x follow. 


STANDARD DEVIATION OF x 


Finite Population Infinite Population 
IN =i @ o 
= -= — Th 
a N (=) TT Ya a 


In comparing the two formulas in equation (7.2), we see that the factor V(N — n)/(N — 1) 
is required for the finite population case but not for the infinite population case. This factor 
is commonly referred to as the finite population correction factor. In many practical 
sampling situations, we find that the population involved, although finite, is “large,” whereas 
the sample size is relatively “small.” In such cases the finite population correction factor 
VN = n)/(N = 1) is close to 1. As a result, the difference between the values of the stand- 
ard deviation of x for the finite and infinite population cases becomes negligible. Then, 

g+ = o/\Vn becomes a good approximation to the standard deviation of x even though the 
population is finite. This observation leads to the following general guideline, or rule of 
thumb, for computing the standard deviation of x. 


USE THE FOLLOWING EXPRESSION TO COMPUTE THE STANDARD DEVIATION OF x 


o-=—> (7.3) 
whenever 


1. The population is infinite; or 
2. The population is finite and the sample size is less than or equal to 5% of the 
population size; that is, n/N = .05. 


In cases where n/N > .05, the finite population version of formula (7.2) should be used 
in the computation of o;. Unless otherwise noted, throughout the text we will assume 
that the population size is “large,” n/N = .05, and expression (7.3) can be used to 
compute o;. 

To compute g+, we need to know øg, the standard deviation of the population. To further 
emphasize the difference between o+ and ø, we refer to the standard deviation of x, o+, as 
the standard error of the mean. In general, the term standard error refers to the standard 
deviation of a point estimator. Later we will see that the value of the standard error of the 
mean is helpful in determining how far the sample mean may be from the population mean. 
Let us now return to the EAI example and compute the standard error of the mean associ- 
ated with simple random samples of 30 EAI employees. 

In Section 7.1 we saw that the standard deviation of annual salary for the population 
of 2500 EAI employees is a = 4000. In this case, the population is finite, with N = 2500. 
However, with a sample size of 30, we have n/N = 30/2500 = .012. Because the sample 
size is less than 5% of the population size, we can ignore the finite population correction 
factor and use equation (7.3) to compute the standard error. 


a 4000 A03 
= Sa CE $ 
* Vn v30 
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Form of the Sampling Distribution of x 


The preceding results concerning the expected value and standard deviation for the sam- 
pling distribution of x are applicable for any population. The final step in identifying the 
characteristics of the sampling distribution of x is to determine the form or shape of the 
sampling distribution. We will consider two cases: (1) The population has a normal distrib- 
ution; and (2) the population does not have a normal distribution. 


Population Has a Normal Distribution In many situations it is reasonable to assume that 
the population from which we are selecting a random sample has a normal, or nearly nor- 
mal, distribution. When the population has a normal distribution, the sampling distribution 
of x is normally distributed for any sample size. 


Population Does Not Have a Normal Distribution When the population from which 
we are selecting a random sample does not have a normal distribution, the central limit 
theorem is helpful in identifying the shape of the sampling distribution of x. A statement 
of the central limit theorem as it applies to the sampling distribution of x follows. 


CENTRAL LIMIT THEOREM 


In selecting random samples of size n from a population, the sampling distribution of 
the sample mean x can be approximated by a normal distribution as the sample size 
becomes large. 


Figure 7.6 shows how the central limit theorem works for three different populations; 
each column refers to one of the populations. The top panel of the figure shows that none 
of the populations are normally distributed. Population I follows a uniform distribution. 
Population II is often called the rabbit-eared distribution. It is symmetric, but the more 
likely values fall in the tails of the distribution. Population III is shaped like the exponential 
distribution; it is skewed to the right. 

The bottom three panels of Figure 7.6 show the shape of the sampling distribution for 
samples of size n = 2, n = 5, and n = 30. When the sample size is 2, we see that the shape 
of each sampling distribution is different from the shape of the corresponding population 
distribution. For samples of size 5, we see that the shapes of the sampling distributions 
for populations I and II begin to look similar to the shape of a normal distribution. Even 
though the shape of the sampling distribution for population III begins to look similar to 
the shape of a normal distribution, some skewness to the right is still present. Finally, for 
samples of size 30, the shapes of each of the three sampling distributions are approxi- 
mately normal. 

From a practitioner standpoint, we often want to know how large the sample size needs 
to be before the central limit theorem applies and we can assume that the shape of the 
sampling distribution is approximately normal. Statistical researchers have investigated 
this question by studying the sampling distribution of x for a variety of populations and a 
variety of sample sizes. General statistical practice is to assume that, for most applications, 
the sampling distribution of x can be approximated by a normal distribution whenever the 
sample is size 30 or more. In cases where the population is highly skewed or outliers are 
present, samples of size 50 may be needed. Finally, if the population is discrete, the sample 
size needed for a normal approximation often depends on the population proportion. We 
say more about this issue when we discuss the sampling distribution of p in Section 7.6. 


Sampling Distribution of x for the EAI Problem 


Let us return to the EAI problem where we previously showed that E(x) = $71,800 
and o; = 730.3. At this point, we do not have any information about the population 
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FIGURE 7.6 Illustration of the Central Limit Theorem for Three Populations 


Population I Population II Population II 
Population 
Distribution 
Values of x Values of x Values of x 
Sampling 
Distribution 
of x 
(n = 2) 
Values of x Values of x Values of x 
Sampling 
Distribution 
of x 
(n = 5) 


Values of Xx Values of x Values of x 


Sampling 
Distribution 
of x 
(n = 30) 


= 


Values of x Values of x Values of x 


distribution; it may or may not be normally distributed. If the population has a normal 
distribution, the sampling distribution of x is normally distributed. If the population does 
not have a normal distribution, the simple random sample of 30 employees and the central 
limit theorem enable us to conclude that the sampling distribution of x can be approxi- 
mated by a normal distribution. In either case, we are comfortable proceeding with the 
conclusion that the sampling distribution of x can be described by the normal distribution 
shown in Figure 7.7. 


Practical Value of the Sampling Distribution of x 


Whenever a simple random sample is selected and the value of the sample mean is used to 
estimate the value of the population mean p, we cannot expect the sample mean to exactly 
equal the population mean. The practical reason we are interested in the sampling distri- 
bution of x is that it can be used to provide probability information about the difference 
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FIGURE 7.7 Sampling Distribution of x for the Mean Annual Salary of 
a Simple Random Sample of 30 EAI Employees 


Sampling distribution 
of x 


Sl 


71,800 


E(X) 


between the sample mean and the population mean. To demonstrate this use, let us return to 
the EAI problem. 

Suppose the personnel director believes the sample mean will be an acceptable estimate 
of the population mean if the sample mean is within $500 of the population mean. How- 
ever, it is not possible to guarantee that the sample mean will be within $500 of the pop- 
ulation mean. Indeed, Table 7.5 and Figure 7.4 show that some of the 500 sample means 
differed by more than $2000 from the population mean. So we must think of the personnel 
director’s request in probability terms. That is, the personnel director is concerned with the 
following question: What is the probability that the sample mean computed using a simple 
random sample of 30 EAI employees will be within $500 of the population mean? 

Because we have identified the properties of the sampling distribution of x (see 
Figure 7.7), we will use this distribution to answer the probability question. Refer to 
the sampling distribution of x shown again in Figure 7.8. With a population mean of 
$71,800, the personnel director wants to know the probability that x is between $71,300 
and $72,300. This probability is given by the darkly shaded area of the sampling distri- 
bution shown in Figure 7.8. Because the sampling distribution is normally distributed, 
with mean 71,800 and standard error of the mean 730.3, we can use the standard normal 
probability table to find the area or probability. 

We first calculate the z value at the upper endpoint of the interval (72,300) and use the 
table to find the cumulative probability at that point (left tail area). Then we compute the 
z value at the lower endpoint of the interval (71,300) and use the table to find the area un- 
der the curve to the left of that point (another left tail area). Subtracting the second tail area 
from the first gives us the desired probability. 

At x = 72,300, we have 

_ 72,300 — 71,800 _ 


- 730.30 o 


Referring to the standard normal probability table, we find a cumulative probability (area 
to the left of z = .68) of .7517. 
At x = 71,300, we have 
71,300 — 71,800 


z= — z300 ~ 0 
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FIGURE 7.8 Probability of a Sample Mean Being Within $500 of the 


Population Mean for a Simple Random Sample of 30 EAI 
Employees 


Sampling distribution 


of ¥ Tz = 730.30 
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The area under the curve to the left of z = —.68 is .2483. Therefore, P(71,300 = x = 
72,300) = P(z = .68) — P(z < —.68) = .7517 — .2483 = .5034. 
Using Excel’s NORM.DIST The desired probability can also be computed using Excel’s NORM.DIST function. 
function is easier and pro- The advantage of using the NORM.DIST function is that we do not have to make a 
vides more accurate results separate computation of the z value. Evaluating the NORM.DIST function at the upper 
than using the tables with endpoint of the interval provides the cumulative probability at 72,300. Entering the formula 
rounded values for z. = NORM. DIST(72300, 71800, 730.30, TRUE) in a cell of an Excel worksheet provides .7532 
for this cumulative probability. Evaluating the NORM.DIST function at the lower endpoint 
of the interval provides the area under the curve to the left of 71,300. Entering the formula 
=NORM.DIST(71300, 71800, 730.30, TRUE) in a cell of an Excel worksheet provides .2468 
for this cumulative probability. The probability of x being in the interval from 71,300 to 
72,300 is then given by .7532 — .2468 = .5064. We note that this result is slightly different 
from the probability obtained using the table, because in using the normal table we rounded 
to two decimal places of accuracy when computing the z value. The result obtained using 
NORM_DIST is thus more accurate. 
The preceding computations show that a simple random sample of 30 EAI employees 
The sampling distribution has a .5064 probability of providing a sample mean x that is within $500 of the population 
of X can be used to provide mean. Thus, there is a 1 — .5064 = .4936 probability that the sampling error will be more 
probability information than $500. In other words, a simple random sample of 30 EAI employees has roughly a 


about how close the 50-50 chance of providing a sample mean within the allowable $500. Perhaps a larger 
sample mean x is to the sample size should be considered. Let us explore this possibility by considering the rela- 
population mean p. tionship between the sample size and the sampling distribution of x. 


Relationship Between the Sample Size 
and the Sampling Distribution of x 


Suppose that in the EAI sampling problem we select a simple random sample of 100 EAI 
employees instead of the 30 originally considered. Intuitively, it would seem that with more 
data provided by the larger sample size, the sample mean based on n = 100 should provide 
a better estimate of the population mean than the sample mean based on n = 30. To see 
how much better, let us consider the relationship between the sample size and the sampling 
distribution of x. 
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FIGURE 7.9 A Comparison of the Sampling Distributions of x for Simple 


Random Samples of n = 30 and n= 100 EAI Employees 


With n = 100, 


oz = 400 


With n = 30, 
oz = 730.3 


71,800 


First note that E(x) = u regardless of the sample size. Thus, the mean of all possible 
values of x is equal to the population mean u regardless of the sample size n. However, note 
that the standard error of the mean, o; = a/V*n, is related to the square root of the sample 
size. Whenever the sample size is increased, the standard error of the mean g decreases. 
With n = 30, the standard error of the mean for the EAI problem is 730.3. However, with 
the increase in the sample size to n = 100, the standard error of the mean is decreased to 


ao 4000 
= 400 
° Vn 4/100 


The sampling distributions of x with n = 30 and n = 100 are shown in Figure 7.9. Be- 
cause the sampling distribution with n = 100 has a smaller standard error, the values of x 
have less variation and tend to be closer to the population mean than the values of x with 
n = 30. 

We can use the sampling distribution of x for the case with n = 100 to compute the proba- 
bility that a simple random sample of 100 EAI employees will provide a sample mean that is 
within $500 of the population mean. In this case the sampling distribution is normal with a 
mean of 71,800 and a standard deviation of 400 (see Figure 7.10). Again, we could compute the 
appropriate z values and use the standard normal probability distribution table to make this prob- 
ability calculation. However, Excel’s NORM.DIST function is easier to use and provides more 
accurate results. Entering the formula =NORM.DIST(72300,71800,400, TRUE) in a cell of an 
Excel worksheet provides the cumulative probability corresponding to x = 72,300. The value 
provided by Excel is .8944. Entering the formula = NORM.DIST (71300,71800,400, TRUE) in 
a cell of an Excel worksheet provides the cumulative probability corresponding to x = 71,300. 
The value provided by Excel is .1056. Thus, the probability of x being in the interval from 
71,300 to 72,300 is given by .8944 — .1056 = .7888. By increasing the sample size from 30 to 
100 EAI employees, we increase the probability that the sampling error will be $500 or less; 
that is, the probability of obtaining a sample mean within $500 of the population mean increases 
from .5064 to .7888. 

The important point in this discussion is that as the sample size increases, the 
standard error of the mean decreases. As a result, a larger sample size will provide a 
higher probability that the sample mean falls within a specified distance of the popula- 
tion mean. 
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FIGURE 7.10 Probability of a Sample Mean Being Within $500 of the Population 


Mean for a Simple Random Sample of 100 EAI Employees 


Sampling distribution 
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P(71,300 = X = 72,300) =.7888 


E a 
NOTE + COMMENT 


In presenting the sampling distribution of X for the EAI prob- p and the population standard deviation o that are needed to 
lem, we took advantage of the fact that the population mean determine the sampling distribution of X will be unknown. In 
H = 71,800 and the population standard deviation @ = 4000 Chapter 8 we show how the sample mean X and the sample 
were known. However, usually the values of the populationmean standard deviation s are used when m and ø are unknown. 


EXERCISES 


Methods 

15. A population has a mean of 200 and a standard deviation of 50. Suppose a simple 
random sample of size 100 is selected and x is used to estimate m. 

a. What is the probability that the sample mean will be within +5 of the population 
mean? 

b. What is the probability that the sample mean will be within +10 of the population 
mean? 

16. Assume the population standard deviation is o = 25. Compute the standard error of 
the mean, o+, for sample sizes of 50, 100, 150, and 200. What can you say about the 
size of the standard error of the mean as the sample size is increased? 

17. Suppose a random sample of size 50 is selected from a population with ø = 10. Find 
the value of the standard error of the mean in each of the following cases (use the 
finite population correction factor if appropriate). 

. The population size is infinite. 

. The population size is N = 50,000. 

. The population size is N = 5000. 

. The population size is N = 500. 


fet) 


a0 g 


Applications 
18. Sampling Distribution for Electronic Associates, Inc., Managers. Refer to the EAI 
sampling problem. Suppose a simple random sample of 60 employees is used. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


330 Chapter 7 Sampling and Sampling Distributions 


a. Sketch the sampling distribution of x when simple random samples of size 60 
are used. 

b. What happens to the sampling distribution of x if simple random samples of size 
120 are used? 

c. What general statement can you make about what happens to the sampling 
distribution of x as the sample size is increased? Does this generalization seem 
logical? Explain. 

19. Finding Probabilities for Electronic Associates, Inc., Managers. In the EAI sam- 
pling problem (see Figure 7.8), we showed that for n = 30, there was .5064 probability 
of obtaining a sample mean within +$500 of the population mean. 

a. What is the probability that x is within $500 of the population mean if a sample of 
size 60 is used? 

b. Answer part (a) for a sample of size 120. 

20. U.S. Unemployment. Barron’s reported that the average number of weeks an individ- 
ual is unemployed is 17.5 weeks. Assume that for the population of all unemployed 
individuals the population mean length of unemployment is 17.5 weeks and that the 
population standard deviation is 4 weeks. Suppose you would like to select a random 
sample of 50 unemployed individuals for a follow-up study. 

a. Show the sampling distribution of x, the sample mean average for a sample of 
50 unemployed individuals. 

b. What is the probability that a simple random sample of 50 unemployed individuals 
will provide a sample mean within | week of the population mean? 

c. What is the probability that a simple random sample of 50 unemployed individuals 
will provide a sample mean within 1/2 week of the population mean? 

21. SAT Scores. In May 2018, The College Board reported the following mean scores for 
two parts of the Scholastic Aptitude Test (SAT): 


Evidence-Based Reading and Writing 533 
Mathematics 527 


Assume that the population standard deviation on each part of the test is 7 = 100. 

a. What is the probability a sample of 90 test takers will provide a sample mean test 
score within 10 points of the population mean of 533 on the Evidence-Based Read- 
ing and Writing part of the test? 

b. What is the probability a sample of 90 test takers will provide a sample mean test 
score within 10 points of the population mean of 527 on the Mathematics part of 
the test? 

c. Comment on the differences between the values computed in parts (a) and (b). 

22. Federal Income Tax Returns. The Wall Street Journal reported that 33% of taxpayers 
with adjusted gross incomes between $30,000 and $60,000 itemized deductions on 
their federal income tax return. The mean amount of deductions for this population of 
taxpayers was $16,642. Assume the standard deviation is @ = $2400. 

a. What is the probability that a sample of taxpayers from this income group who 
have itemized deductions will show a sample mean within $200 of the population 
mean for each of the following sample sizes: 30, 50, 100, and 400? 

b. What is the advantage of a larger sample size when attempting to estimate the 
population mean? 

23. College Graduate-Level Wages. The Economic Policy Institute periodically issues 
reports on worker’s wages. The institute reported that mean wages for male college 
graduates were $37.39 per hour and for female college graduates were $27.83 per hour 
in 2017. Assume the standard deviation for male graduates is $4.60, and for female 
graduates it is $4.10. 

a. What is the probability that a sample of 50 male graduates will provide a sample 
mean within $1.00 of the population mean, $37.39? 
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b. What is the probability that a sample of 50 female graduates will provide a sample 
mean within $1.00 of the population mean, $27.83? 

c. In which of the preceding two cases, part (a) or part (b), do we have a higher 
probability of obtaining a sample estimate within $1.00 of the population mean? 
Why? 

d. What is the probability that a sample of 120 female graduates will provide a sample 
mean more than $.60 below the population mean, 27.83? 

24. State Rainfalls. According to the Current Results website, the state of California has 
a mean annual rainfall of 22 inches, whereas the state of New York has a mean annual 
rainfall of 42 inches. Assume that the standard deviation for both states is 4 inches. A 
sample of 30 years of rainfall for California and a sample of 45 years of rainfall for 
New York have been taken. 

a. Show the probability distribution of the sample mean annual rainfall for California. 

b. What is the probability that the sample mean is within 1 inch of the population 
mean for California? 

c. What is the probability that the sample mean is within | inch of the population 
mean for New York? 

d. In which case, part (b) or part (c), is the probability of obtaining a sample mean 
within | inch of the population mean greater? Why? 

25. Income Tax Return Preparation Fees. The CPA Practice Advisor reports that the 
mean preparation fee for 2017 federal income tax returns was $273. Use this price as 
the population mean and assume the population standard deviation of preparation fees 
is $100. 

a. What is the probability that the mean price for a sample of 30 federal income tax 
returns is within $16 of the population mean? 

b. What is the probability that the mean price for a sample of 50 federal income tax 
returns is within $16 of the population mean? 

c. What is the probability that the mean price for a sample of 100 federal income tax 
returns is within $16 of the population mean? 

d. Which, if any, of the sample sizes in parts (a), (b), and (c) would you recommend to 
ensure at least a .95 probability that the sample mean is within $16 of the popula- 
tion mean? 

26. Employee Ages. To estimate the mean age for a population of 4000 employees, a 
simple random sample of 40 employees is selected. 

a. Would you use the finite population correction factor in calculating the standard 
error of the mean? Explain. 

b. If the population standard deviation is æ = 8.2 years, compute the standard error 
both with and without the finite population correction factor. What is the rationale 
for ignoring the finite population correction factor whenever n/N = .05? 

c. What is the probability that the sample mean age of the employees will be within 
+2 years of the population mean age? 


7.6 Sampling Distribution of p 


The sample proportion p is the point estimator of the population proportion p. The formula 
for computing the sample proportion is 

2 

Pn 
where 


x = the number of elements in the sample that possess the characteristic of interest 
n = sample size 


As noted in Section 7.4, the sample proportion p is a random variable and its probability 
distribution is called the sampling distribution of p. 
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SAMPLING DISTRIBUTION OF p 


The sampling distribution of p is the probability distribution of all possible values of 
the sample proportion p. 


To determine how close the sample proportion p is to the population proportion p, we 
need to understand the properties of the sampling distribution of p: the expected value of 
p, the standard deviation of p, and the shape or form of the sampling distribution of p. 


Expected Value of p 


The expected value of p, the mean of all possible values of p, is equal to the population 
proportion p. 


EXPECTED VALUE OF p 


E(p) = p (7.4) 


where 
E(p) = the expected value of p 


p = the population proportion 


Because E(p) = p, p is an unbiased estimator of p. Recall from Section 7.1 we noted that 
p = .60 for the EAI population, where p is the proportion of the population of employees 
who participated in the company’s management training program. Thus, the expected value 
of p for the EAI sampling problem is .60. 


Standard Deviation of p 


Just as we found for the standard deviation of x, the standard deviation of p depends on 
whether the population is finite or infinite. The two formulas for computing the standard 
deviation of p follow. 


STANDARD DEVIATION OF p 


Finite Population Infinite Population 
ae N-a j= ig) 
we J: =] N! P o= z (7.5) 


Comparing the two formulas in equation (7.5), we see that the only difference is the use of 
the finite population correction factor V(N — n)/(N — T). 

As was the case with the sample mean x, the difference between the expressions for the 
finite population and the infinite population becomes negligible if the size of the finite popu- 
lation is large in comparison to the sample size. We follow the same rule of thumb that we 
recommended for the sample mean. That is, if the population is finite with n/N <= .05, we 
will use o = Vp(1 — p)/n. However, if the population is finite with n/N > .05, the finite 
population correction factor should be used. Again, unless specifically noted, throughout 
the text we will assume that the population size is large in relation to the sample size and 
thus the finite population correction factor is unnecessary. 

In Section 7.5 we used the term standard error of the mean to refer to the standard 
deviation of x. We stated that in general the term standard error refers to the standard 
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deviation of a point estimator. Thus, for proportions we use standard error of the propor- 
tion to refer to the standard deviation of p. Let us now return to the EAI example and 
compute the standard error of the proportion associated with simple random samples of 
30 EAI employees. 

For the EAI study we know that the population proportion of employees who partici- 
pated in the management training program is p = .60. With n/N = 30/2500 = .012, we can 
ignore the finite population correction factor when we compute the standard error of the 
proportion. For the simple random sample of 30 employees, a; is 


= [pa-pm [60a = 60) | 
o; - z 0894 


Form of the Sampling Distribution of p 


Now that we know the mean and standard deviation of the sampling distribution of p, 
the final step is to determine the form or shape of the sampling distribution. The sample 
proportion is p = x/n. For a simple random sample from a large population, the value of 
x is a binomial random variable indicating the number of elements in the sample with the 
characteristic of interest. Because n is a constant, the probability of x/n is the same as 
the binomial probability of x, which means that the sampling distribution of p is also a 
discrete probability distribution and that the probability for each value of x/n is the same as 
the probability of x. 

Statisticians have shown that a binomial distribution can be approximated by a 
normal distribution whenever the sample size is large enough to satisfy the following 
two conditions: 


np2=5S and n(l-p)=5 


Assuming these two conditions are satisfied, the probability distribution of x in the sample 
proportion, p = x/n, can be approximated by a normal distribution. And because n is a 
constant, the sampling distribution of p can also be approximated by a normal distribution. 
This approximation is stated as follows: 


The sampling distribution of p can be approximated by a normal distribution 
whenever np = 5 andn — p) =5. 


In practical applications, when an estimate of a population proportion is desired, we find 
that sample sizes are almost always large enough to permit the use of a normal approxima- 
tion for the sampling distribution of p. 

Recall that for the EAI sampling problem we know that the population proportion of 
employees who participated in the training program is p = .60. With a simple random 
sample of size 30, we have np = 30(.60) = 18 and n(1 — p) = 30(.40) = 12. Thus, 
the sampling distribution of p can be approximated by the normal distribution shown 
in Figure 7.11. 


Practical Value of the Sampling Distribution of p 


The practical value of the sampling distribution of p is that it can be used to provide proba- 
bility information about the difference between the sample proportion and the population 
proportion. For instance, suppose that in the EAI problem the personnel director wants to 
know the probability of obtaining a value of p that is within .05 of the population pro- 
portion of EAI employees who participated in the training program. That is, what is the 
probability of obtaining a sample with a sample proportion p between .55 and .65? The 
darkly shaded area in Figure 7.12 shows this probability. Using the fact that the sampling 
distribution of p can be approximated by a normal probability distribution with a mean of 
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FIGURE 7.11 Sampling Distribution of p for the Proportion of EAI Employees 


Who Participated in the Management Training Program 


Sampling distribution 
of p 


FIGURE 7.12 Probability of Obtaining p Between .55 and .65 


Sampling distribution 


o- = .0894 
a P 
of p 


P(.55 < p S .65) = .4240 = .7120 — .2 
P(p = .55) = .2880 (oS = jo == 65) 0 = .7120 880 


oe 6065 


.60 and a standard error of o5 = .0894, we can use Excel’s NORM.DIST function to make 
this calculation. Entering the formula =NORM. DIST(.65,.60,.0894, TRUE) in a cell of an 
Excel worksheet provides the cumulative probability corresponding to p = .65. The value 
calculated by Excel is .7120. Entering the formula =NORM. DIST/(.55,.60,.0894, TRUE) in 
a cell of an Excel worksheet provides the cumulative probability corresponding to p = .55. 
The value calculated by Excel is .2880. Thus, the probability of p being in the interval 
from .55 to .65 is given by .7120 — .2880 = .4240. 

If we consider increasing the sample size to n = 100, the standard error of the propor- 
tion becomes 

.60(1 — .60) 


3 = 0490 
7p 100 


With a sample size of 100 EAI employees, the probability of the sample proportion 
having a value within .05 of the population proportion can now be computed. Because 
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the sampling distribution is approximately normal, with mean .60 and standard deviation 
.0490, we can use Excel’s NORM.DIST function to make this calculation. Entering the 
formula =NORM.DIST/(.65,.60,.0490, TRUE) in a cell of an Excel worksheet provides the 
cumulative probability corresponding to p = .65. The value calculated by Excel is .8462. 
Entering the formula =NORM. DIST(.55,.60,.0490, TRUE) in a cell of an Excel work- 
sheet provides the cumulative probability corresponding to p = .55. The value calculated 
by Excel is .1538. Thus, the probability of p being in the interval from .55 to .65 is given 
by .8462 — .1538 = .6924. Increasing the sample size increases the probability that the 
sampling error will be less than or equal to .05 by .2684 (from .4240 to .6924). 


SHOSHSHSHSSSHSHSHSHSHSHSHSHSHSHSHSHSHSHSHHSHSHSSHHSHSHSHHSHSHSSHSHSHSSEHHHHSEEEEEEEEe 


Methods 

27. A random sample of size 100 is selected from a population with p = .40. 
a. What is the expected value of p? 

b. What is the standard error of p? 
c. Show the sampling distribution of p. 
d. What does the sampling distribution of p show? 

28. A population proportion is .40. A random sample of size 200 will be taken and the 
sample proportion p will be used to estimate the population proportion. 

a. What is the probability that the sample proportion will be within +.03 of the popu- 
lation proportion? 

b. What is the probability that the sample proportion will be within +.05 of the popu- 
lation proportion? 

29. Assume that the population proportion is .55. Compute the standard error of the pro- 
portion, o;, for sample sizes of 100, 200, 500, and 1000. What can you say about the 
size of the standard error of the proportion as the sample size is increased? 

30. The population proportion is .30. What is the probability that a sample proportion will 
be within +.04 of the population proportion for each of the following sample sizes? 


a. n = 100 

b. n = 200 

c. n = 500 

d. n = 1000 

e. What is the advantage of a larger sample size? 


Applications 

31. Orders from First-Time Customers. The president of Doerman Distributors, Inc. be- 
lieves that 30% of the firm’s orders come from first-time customers. A random sample 
of 100 orders will be used to estimate the proportion of first-time customers. 

a. Assume that the president is correct and p = .30. What is the sampling distribution 
of p for this study? 

b. What is the probability that the sample proportion p will be between .20 and .40? 

c. What is the probability that the sample proportion will be between .25 and .35? 

32. Ages of Entrepreneurs. The Wall Street Journal reported that the age at first startup 
for 55% of entrepreneurs was 29 years or less and the age at first startup for 45% of 
entrepreneurs was 30 years or more. 

a. Suppose a sample of 200 entrepreneurs will be taken to learn about the most impor- 
tant qualities of entrepreneurs. Show the sampling distribution of p where p is the 
sample proportion of entrepreneurs whose first startup was at 29 years of age or less. 

b. What is the probability that the sample proportion in part (a) will be within +.05 of 
its population proportion? 

c. Suppose a sample of 200 entrepreneurs will be taken to learn about the most important 
qualities of entrepreneurs. Show the sampling distribution of p where p is now the 
sample proportion of entrepreneurs whose first startup was at 30 years of age or more. 
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d. What is the probability that the sample proportion in part (c) will be within +.05 of 
its population proportion? 

e. Is the probability different in parts (b) and (d)? Why? 

f. Answer part (b) for a sample of size 400. Is the probability smaller? Why? 

33. Food Waste. In 2017, the Restaurant Hospitality website reported that only 10% 
of surplus food is being recovered in the food-service and restaurant sector, leaving 
approximately 1.5 billion meals per year uneaten. Assume this is the true population 
proportion and that you plan to take a sample survey of 525 companies in the food 
service and restaurant sector to further investigate their behavior. 

a. Show the sampling distribution of p, the proportion of food recovered by your 
sample respondents. 

b. What is the probability that your survey will provide a sample proportion within 
*+.03 of the population proportion? 

c. What is the probability that your survey will provide a sample proportion within 
+.015 of the population proportion? 

34. Unnecessary Medical Care. According to Reader’s Digest, 42% of primary care 
doctors think their patients receive unnecessary medical care. 

a. Suppose a sample of 300 primary care doctors was taken. Show the sampling 
distribution of the proportion of the doctors who think their patients receive 
unnecessary medical care. 

b. What is the probability that the sample proportion will be within +.03 of the 
population proportion? 

c. What is the probability that the sample proportion will be within +.05 of the popu- 
lation proportion? 

d. What would be the effect of taking a larger sample on the probabilities in parts (b) 
and (c)? Why? 

35. Better Business Bureau Complaints. In 2016 the Better Business Bureau settled 
80% of complaints they received in the United States. Suppose you have been hired 
by the Better Business Bureau to investigate the complaints they received this year 
involving new car dealers. You plan to select a sample of new car dealer complaints 
to estimate the proportion of complaints the Better Business Bureau is able to settle. 
Assume the population proportion of complaints settled for new car dealers is .80, the 
same as the overall proportion of complaints settled in 2016. 

a. Suppose you select a sample of 200 complaints involving new car dealers. 

Show the sampling distribution of p. 

b. Based upon a sample of 200 complaints, what is the probability that the sample 
proportion will be within .04 of the population proportion? 

c. Suppose you select a sample of 450 complaints involving new car dealers. Show 
the sampling distribution of p. 

d. Based upon the smaller sample of only 450 complaints, what is the probability that 
the sample proportion will be within .04 of the population proportion? 

e. As measured by the increase in probability, how much do you gain in precision by 
taking the larger sample in part (d)? 

36. Product Labeling. The Grocery Manufacturers of America reported that 76% of con- 
sumers read the ingredients listed on a product’s label. Assume the population propor- 
tion is p = .76 and a sample of 400 consumers is selected from the population. 

a. Show the sampling distribution of the sample proportion p, where p is the propor- 
tion of the sampled consumers who read the ingredients listed on a product’s label. 

b. What is the probability that the sample proportion will be within +.03 of the popu- 
lation proportion? 

c. Answer part (b) for a sample of 750 consumers. 

37. Household Grocery Expenditures. The Food Marketing Institute shows that 17% 
of households spend more than $100 per week on groceries. Assume the population 
proportion is p = .17 and a simple random sample of 800 households will be selected 
from the population. 
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a. Show the sampling distribution of p, the sample proportion of households spending 
more than $100 per week on groceries. 

b. What is the probability that the sample proportion will be within +.02 of the popu- 
lation proportion? 

c. Answer part (b) for a sample of 1600 households. 


7.7 Other Sampling Methods 


This section provides a We described simple random sampling as a procedure for sampling from a finite population 
brief introduction to and discussed the properties of the sampling distributions of x and p when simple random 
survey sampling methods sampling is used. Other methods such as stratified random sampling, cluster sampling, and 
other than simple random systematic sampling provide advantages over simple random sampling in some of these 
sampling. situations. In this section we briefly introduce these alternative sampling methods. 


Stratified Random Sampling 


Stratified random sampling In stratified random sampling, the elements in the population are first divided into groups 

works best when the called strata, such that each element in the population belongs to one and only one stratum. 

variance among elements The basis for forming the strata, such as department, location, age, and industry type, is at 

in each stratum is relatively the discretion of the designer of the sample. However, the best results are obtained when 

small. the elements within each stratum are as much alike as possible. Figure 7.13 is a diagram of 
a population divided into H strata. 

After the strata are formed, a simple random sample is taken from each stratum. For- 
mulas are available for combining the results for the individual stratum samples into one 
estimate of the population parameter of interest. The value of stratified random sampling 
depends on how homogeneous the elements are within the strata. If elements within strata 
are alike, the strata will have low variances. Thus relatively small sample sizes can be 
used to obtain good estimates of the strata characteristics. If strata are homogeneous, the 
stratified random sampling procedure provides results just as precise as those of simple 
random sampling by using a smaller total sample size. 


Cluster Sampling 
Cluster sampling works In cluster sampling, the elements in the population are first divided into separate groups 
best when each cluster called clusters. Each element of the population belongs to one and only one cluster (see 
provides a small-scale Figure 7.14). A simple random sample of the clusters is then taken. All elements within 
representation of the each sampled cluster form the sample. Cluster sampling tends to provide the best results 
population. when the elements within the clusters are not alike. In the ideal case, each cluster is a 


representative small-scale version of the entire population. The value of cluster sampling 
depends on how representative each cluster is of the entire population. If all clusters are 
alike in this regard, sampling a small number of clusters will provide good estimates of the 
population parameters. 


FIGURE 7.13 Diagram for Stratified Random Sampling 


Population 


Stratum 1 Stratum 2 Stratum H 
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FIGURE 7.14 | Diagram for Cluster Sampling 


Population 


Cluster 1 Cluster 2 Cluster K 


One of the primary applications of cluster sampling is area sampling, where clusters are 
city blocks or other well-defined areas. Cluster sampling generally requires a larger total 
sample size than either simple random sampling or stratified random sampling. However, 
it can result in cost savings because when an interviewer is sent to a sampled cluster (e.g., 
a city-block location), many sample observations can be obtained in a relatively short time. 
Hence, a larger sample size may be obtainable with a significantly lower total cost. 


Systematic Sampling 


In some sampling situations, especially those with large populations, it is time-consuming 
to select a simple random sample by first finding a random number and then counting or 
searching through the list of the population until the corresponding element is found. An 
alternative to simple random sampling is systematic sampling. For example, if a sample 
size of 50 is desired from a population containing 5000 elements, we will sample one 
element for every 5000/50 = 100 elements in the population. A systematic sample for this 
case involves selecting randomly one of the first 100 elements from the population list. 
Other sample elements are identified by starting with the first sampled element and then se- 
lecting every 100th element that follows in the population list. In effect, the sample of 50 is 
identified by moving systematically through the population and identifying every 100th 
element after the first randomly selected element. The sample of 50 usually will be easier 
to identify in this way than it would be if simple random sampling were used. Because the 
first element selected is a random choice, a systematic sample is usually assumed to have 
the properties of a simple random sample. This assumption is especially applicable when 
the list of elements in the population is a random ordering of the elements. 


Convenience Sampling 


The sampling methods discussed thus far are referred to as probability sampling 
techniques. Elements selected from the population have a known probability of being 
included in the sample. The advantage of probability sampling is that the sampling dis- 
tribution of the appropriate sample statistic generally can be identified. Formulas such as 
the ones for simple random sampling presented in this chapter can be used to determine 
the properties of the sampling distribution. Then the sampling distribution can be used to 
make probability statements about the error associated with using the sample results to 
make inferences about the population. 

Convenience sampling is a nonprobability sampling technique. As the name implies, 
the sample is identified primarily by convenience. Elements are included in the sample 
without prespecified or known probabilities of being selected. For example, a professor 
conducting research at a university may use student volunteers to constitute a sample 
simply because they are readily available and will participate as subjects for little or no 
cost. Similarly, an inspector may sample a shipment of oranges by selecting oranges 
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haphazardly from among several crates. Labeling each orange and using a probability 
method of sampling would be impractical. Samples such as wildlife captures and volunteer 
panels for consumer research are also convenience samples. 

Convenience samples have the advantage of relatively easy sample selection and data 
collection; however, it is impossible to evaluate the “goodness” of the sample in terms of 
its representativeness of the population. A convenience sample may provide good results or 
it may not; no statistically justified procedure allows a probability analysis and inference 
about the quality of the sample results. Sometimes researchers apply statistical methods 
designed for probability samples to a convenience sample, arguing that the convenience 
sample can be treated as though it were a probability sample. However, this argument 
cannot be supported, and we should be cautious in interpreting the results of convenience 
samples that are used to make inferences about populations. 


Judgment Sampling 


Judgment sampling is another nonprobability sampling technique. In this approach, the 
person most knowledgeable on the subject of the study selects elements of the popula- 
tion that he or she feels are most representative of the population. Often this method is a 
relatively easy way of selecting a sample. For example, a reporter may sample two or three 
senators, judging that those senators reflect the general opinion of all senators. However, 
the quality of the sample results depends on the judgment of the person selecting the 
sample. Again, great caution is warranted in drawing conclusions based on judgment sam- 
ples used to make inferences about populations. 


NOTE + COMMENT 


We recommend using probability sampling meth- terms of the closeness of the results to the population param- 


ods when sampling from finite populations: simple random 
sampling, stratified random sampling, cluster sampling, or 
systematic sampling. For these methods, formulas are avail- 
able for evaluating the “goodness” of the sample results in 


eters being estimated. An evaluation of the goodness can- 
not be made with convenience or judgment sampling. Thus, 
great care should be used in interpreting the results based on 
nonprobability sampling methods. 


7.8 Practical Advice: Big Data and Errors in Sampling 


The purpose of collecting a sample is to make inferences and answer research questions 
about a population. Therefore, it is important that the sample look like, or be representative 
of, the population being investigated. In practice, individual samples always, to varying 
degrees, fail to be perfectly representative of the populations from which they have been 
taken. There are two general reasons a sample may fail to be representative of the popula- 
tion of interest: sampling error and nonsampling error. 


Sampling Error 


One reason a sample may fail to represent the population from which it has been taken is 
sampling error, or deviation of the sample from the population that results from random 
sampling. If repeated independent random samples of the same size are collected from 
the population of interest using a probability sampling technique, on average the samples 
will be representative of the population. This is the justification for collecting sample data 
randomly. However, the random collection of sample data does not ensure that any single 
sample will be perfectly representative of the population of interest; when collecting a 
sample randomly, the data in the sample cannot be expected to be perfectly representat- 
ive of the population from which it has been taken. Sampling error is unavoidable when 
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collecting a random sample; this is a risk we must accept when we chose to collect a ran- 
dom sample rather than incur the costs associated with taking a census of the population. 

As expressed by equations (7.2) and (7.5), the standard errors of the sampling distribu- 
tions of the sample mean x and the sample proportion p reflect the potential for sampling 
error when using sample data to estimate the population mean yu and the population pro- 
portion p, respectively. As the sample size n increases, the potential impact of extreme val- 
ues on the statistic decreases, so there is less variation in the potential values of the statistic 
produced by the sample and the standard errors of these sampling distributions decrease. 
Because these standard errors reflect the potential for sampling error when using sample 
data to estimate the population mean u and the population proportion p, we see that for an 
extremely large sample there may be little potential for sampling error. 


Nonsampling Error 


Although the standard error of a sampling distribution decreases as the sample size n 
increases, this does not mean that we can conclude that an extremely large sample will al- 
ways provide reliable information about the population of interest; this is because sampling 
error is not the sole reason a sample may fail to represent the target population. Deviations 
of the sample from the population that occur for reasons other than random sampling are 
referred to as nonsampling error. Nonsampling error can occur for a variety of reasons. 

Consider the online news service PenningtonDailyTimes.com (PDT). Because PDT’s 
primary source of revenue is the sale of advertising, the news service is intent on collect- 
ing sample data on the behavior of visitors to its website in order to support its advertising 
sales. Prospective advertisers are willing to pay a premium to advertise on websites that 
have long visit times, so PDT’s management is keenly interested in the amount of time 
customers spend during their visits to PDT’s website. Advertisers are also concerned with 
how frequently visitors to a website click on any of the ads featured on the website, so 
PDT is also interested in whether visitors to its website clicked on any of the ads featured 
on PenningtonDailyTimes.com. 

From whom should PDT collect its data? Should it collect data on current visits to 
PenningtonDailyTimes.com? Should it attempt to attract new visitors and collect data on 
these visits? If so, should it measure the time spent at its website by visitors it has attracted 
from competitors’ websites or from visitors who do not routinely visit online news sites? 
The answers to these questions depend on PDT’s research objectives. Is the company at- 
tempting to evaluate its current market, assess the potential of customers it can attract from 
competitors, or explore the potential of an entirely new market such as individuals who do 
not routinely obtain their news from online news services? If the research objective and the 


Nonsampling error can population from which the sample is to be drawn are not aligned, the data that PDT collects 
occur in a sample or will not help the company accomplish its research objective. This type of error is referred 
a census. to as a coverage error. 


Even when the sample is taken from the appropriate population, nonsampling error can 
occur when segments of the target population are systematically underrepresented or overrep- 
resented in the sample. This may occur because the study design is flawed or because some 
segments of the population are either more likely or less likely to respond. Suppose PDT im- 
plements a pop-up questionnaire that opens when a visitor leaves PenningtonDailyTimes.com. 
Visitors to PenningtonDailyTimes.com who have installed pop-up blockers will be likely under- 
represented, and visitors to PenningtonDailyTimes.com who have not installed pop-up blockers 
will likely be overrepresented. If the behavior of PenningtonDailyTimes.com visitors who have 
installed pop-up blockers differs from the behaviors of PenningtonDailyTimes.com visitors who 
have not installed pop-up blockers, attempting to draw conclusions from this sample about how 
all visitors to the PDT website behave may be misleading. This type of error is referred to as a 
nonresponse error. 

Another potential source of nonsampling error is incorrect measurement of the charac- 
teristic of interest. If PDT asks questions that are ambiguous or difficult for respondents 
to understand, the responses may not accurately reflect how the respondents intended to 
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respond. For example, respondents may be unsure how to respond if PDT asks “Are the 

news stories on PenningtonDailyTimes.com compelling and accurate ?”. How should 

a visitor respond if she or he feels the news stories on PenningtonDailyTimes.com are 
Errors that are introduced compelling but erroneous? What response is appropriate if the respondent feels the news 
by interviewers or during stories on PenningtonDailyTimes.com are accurate but dull? A similar issue can arise if 
the recording and prepa- a question is asked in a biased or leading way. If PDT asks “Many readers find the news 
ration of the data are other stories on PenningtonDailyTimes.com to be compelling and accurate. Do you find the news 
types of nonsampling error. stories on PenningtonDailyTimes.com to be compelling and accurate?” , the qualifying 


These types of error are statement PDT makes prior to the actual question will likely result in a bias toward positive 
referred to as interviewer responses. Incorrect measurement of the characteristic of interest can also occur when 
errors and processing respondents provide incorrect answers; this may be due to a respondent’s poor recall or un- 
errors, respectively. willingness to respond honestly. This type of error is referred to as a measurement error. 


Nonsampling error can introduce bias into the estimates produced using the sample, and 
this bias can mislead decision makers who use the sample data in their decision-making 
processes. No matter how small or large the sample, we must contend with this limitation 
of sampling whenever we use sample data to gain insight into a population of interest. 
Although sampling error decreases as the size of the sample increases, an extremely large 
sample can still suffer from nonsampling error and fail to be representative of the popula- 
tion of interest. When sampling, care must be taken to ensure that we minimize the intro- 
duction of nonsampling error into the data collection process. This can be done by carrying 
out the following steps: 


e Carefully define the target population before collecting sample data, and subse- 
quently design the data collection procedure so that a probability sample is drawn 
from this target population. 

e Carefully design the data collection process and train the data collectors. 

e Pretest the data collection procedure to identify and correct for potential sources of 
nonsampling error prior to final data collection. 

e Use stratified random sampling when population-level information about an impor- 
tant qualitative variable is available to ensure that the sample is representative of the 
population with respect to that qualitative characteristic. 

e Use cluster sampling when the population can be divided into heterogeneous sub- 
groups or clusters. 

e Use systematic sampling when population-level information about an important 
quantitative variable is available to ensure that the sample is representative of the 
population with respect to that quantitative characteristic. 


Finally, recognize that every random sample (even an extremely large random sample) 
will suffer from some degree of sampling error, and eliminating all potential sources of 
nonsampling error may be impractical. Understanding these limitations of sampling will 
enable us to be more realistic and pragmatic when interpreting sample data and using 
sample data to draw conclusions about the target population. 


Big Data 
Recent estimates state that approximately 2.5 quintillion bytes of data are created world- 
wide each day. This represents a dramatic increase from the estimated 100 gigabytes (GB) 
of data generated worldwide per day in 1992, the 100 GB of data generated worldwide 
per hour in 1997, and the 100 GB of data generated worldwide per second in 2002. Every 
minute, there is an average of 216,000 Instagram posts, 204,000,000 emails sent, 12 hours 
of footage uploaded to YouTube, and 277,000 tweets posted on Twitter. Without question, 
the amount of data that is now generated is overwhelming, and this trend is certainly ex- 
pected to continue. 

In each of these cases the data sets that are generated are so large or complex that 
current data processing capacity and/or analytic methods are not adequate for analyzing 
the data. Thus, each is an example of big data. There are myriad other sources of big data. 
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TABLE 7.6 Terminology for Describing the Size of Data Sets 


Number of Bytes Metric Name 

1000! kB kilobyte 
10007 MB megabyte 
1000° GB gigabyte 
1000* TB terabyte 
1000° PB petabyte 
1000° EB exabyte 
10007 ZB zettabyte 
10008 YB yottabyte 


Sensors and mobile devices transmit enormous amounts of data. Internet activities, digital 
processes, and social media interactions also produce vast quantities of data. 

The amount of data has increased so rapidly that our vocabulary for describing a data 
set by its size must expand. A few years ago, a petabyte of data seemed almost unimagin- 
ably large, but we now routinely describe data in terms of yottabytes. Table 7.6 summarizes 
terminology for describing the size of data sets. 


Understanding What Big Data Is 


The processes that generate big data can be described by four attributes or dimensions that 
are referred to as the four Vs: 


e Volume—the amount of data generated 

e Variety—the diversity in types and structures of data generated 
e Veracity—the reliability of the data generated 

e Velocity—the speed at which the data are generated 


A high degree of any of these attributes individually is sufficient to generate big data, 
and when they occur at high levels simultaneously the resulting amount of data can be 
overwhelmingly large. Technological advances and improvements in electronic (and often 
automated) data collection make it easy to collect millions, or even billions, of observations 
in a relatively short time. Businesses are collecting greater volumes of an increasing variety 
of data at a higher velocity than ever. 

To understand the challenges presented by big data, we consider its structural dimen- 
sions. Big data can be tall data; a data set that has so many observations that traditional 
statistical inference has little meaning. For example, producers of consumer goods collect 
information on the sentiment expressed in millions of social media posts each day to better 
understand consumer perceptions of their products. Such data consist of the sentiment 
expressed (the variable) in millions (or over time, even billions) of social media posts (the 
observations). Big data can also be wide data; a data set that has so many variables that 
simultaneous consideration of all variables is infeasible. For example, a high-resolution 
image can comprise millions or billions of pixels. The data used by facial recognition al- 
gorithms consider each pixel in an image when comparing an image to other images in an 
attempt to find a match. Thus, these algorithms make use of the characteristics of millions 
or billions of pixels (the variables) for relatively few high-resolution images (the observa- 
tions). Of course, big data can be both tall and wide, and the resulting data set can again be 
overwhelmingly large. 

Statistics are useful tools for understanding the information embedded in a big data set, 
but we must be careful when using statistics to analyze big data. It is important that we un- 
derstand the limitations of statistics when applied to big data and we temper our interpreta- 
tions accordingly. Because tall data are the most common form of big data used in business, 
we focus on this structure in the discussions throughout the remainder of this section. 
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A sample of one million or 
more visitors might seem 
unrealistic, but keep in 
mind that Amazon.com 
had over 91 million visitors 
in March 2016 (quant- 
cast.com). 
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TABLE 7.7 Standard Error of the Sample Mean x at Various Sample Sizes 


Standard Error 0, =a/Vn 
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Sample Size n 


10 6.32456 

100 2.00000 

1,000 0.63246 

10,000 0.20000 
100,000 0.06325 
1,000,000 0.02000 
10,000,000 0.00632 
100,000,000 0.00200 
1,000,000,000 0.00063 


Implications of Big Data for Sampling Error 


Let’s revisit the data collection problem of online news service PenningtonDailyTimes.com 
(PDT). Because PDT’s primary source of revenue is the sale of advertising, PDT’s man- 
agement is interested in the amount of time customers spend during their visits to PDT’s 
website. From historical data, PDT has estimated that the standard deviation of the time 
spent by individual customers when they visit PDT’s website is o = 20 seconds. Table 7.7 
shows how the standard error of the sampling distribution of the sample mean time spent by 
individual customers when they visit PDT’s website decreases as the sample size increases. 

PDT also wants to collect information from its sample respondents on whether a 
visitor to its website clicked on any of the ads featured on the website. From its historical 
data, PDT knows that 51% of past visitors to its website clicked on any ad featured on 
the website, so it will use this value as p to estimate the standard error. Table 7.8 shows 
how the standard error of the sampling distribution of the proportion of the sample that 
clicked on any of the ads featured on PenningtonDailyTimes.com decreases as the sample 
size increases. 

The PDT example illustrates the general relationship between standard errors and the 
sample size. We see in Table 7.7 that the standard error of the sample mean decreases as 
the sample size increases. For a sample of n = 10, the standard error of the sample mean 
is 6.32456; when we increase the sample size to n = 100,000, the standard error of the 
sample mean decreases to .06325; and at a sample size of n = 1,000,000,000, the standard 
error of the sample mean decreases to only .00063. In Table 7.8 we see that the standard 


TABLE 7.8 Standard Error of the Sample Proportion p at Various Sample 


Sizes 


Sample Size n 


Standard Error ø = V p(1 — p)/n 


10 .15811 

100 .05000 

1,000 .01581 

10,000 .00500 
100,000 .00158 
1,000,000 .00050 
10,000,000 .00016 
100,000,000 .00005 
1,000,000,000 .00002 
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error of the sample proportion also decreases as the sample size increases. For a sample of 
n = 10, the standard error of the sample proportion is .15811; when we increase the sample 
size ton = 100,000, the standard error of the sample proportion decreases to .00158; and 
at a sample size of n = 1,000,000,000, the standard error of the sample mean decreases to 
only .00002. In both Table 7.7 and Table 7.8, the standard error when n =1,000,000,000 
is one ten-thousandth of the standard error when n = 10. 

Suppose that last year the mean time spent by all visitors to the PDT website was 
84 seconds. Further suppose that the mean time has not changed since last year and that 
PDT now collects a sample of 1,000,000 visitors to its website. With n = 1,000,000 and 
the assumed known value of ø = 20 seconds for the population standard deviation, the 
standard error will be o; = a/ Vn = 20/V 1,000,000 = 0.02. Note that because of the very 
large sample size, the sampling distribution of the sample mean x will be normally distrib- 
uted. PDT can use this information to calculate the probability the sample mean x will fall 
within .15 of the population mean, or within the interval 84 + .15. At x = 83.85 


83.85 — 84 
ee as 
z 02 
and at x = 84.15, 
84.15 — 84 
as 
z 02 


Because P(z = —7.5) = .0000 and P(z = 7.5) = 1.0000, the probability the sample mean x 
will fall within 1.5 of the population mean wp is 


P(-7.5 Sz = 7.5) = 1.0000 — .0000 = 1.0000 
Using Excel we find 
NORM.DIST(84.15,84,.02, TRUE) — NORM.DIST(83.85,84,.02, TRUE) = 1.000 


Now, suppose that for the new sample of 1,000,000 visitors the sample mean time spent 
by all visitors to the PDT website is 88 seconds. What could PDT conclude? Its calcu- 
lations show that the probability the sample mean will be between 83.85 and 84.15 is 
approximately 1.0000, and yet the mean of PDT’s sample is 88. There are three possible 
reasons that PDT’s sample mean differs so substantially from the population mean for 
last year: 


e sampling error 
e nonsampling error 
e the population mean has changed since last year 


Because the sample size is extremely large, the sample should have very little sampling 
error, and so sampling error cannot explain the substantial difference between PDT’s 
sample mean of x = 88 seconds and the population mean of u = 84 seconds for last 
year. Nonsampling error is a possible explanation and should be investigated. If PDT 
determines that it introduced little or no nonsampling error into its sample data, the only 
remaining plausible explanation for the substantial difference between PDT’s sample 
mean and the population mean for last year is that the population mean has changed 
since last year. If the sample was collected properly, it provides evidence of a potentially 
important change in behavior of visitors to the PDT website that could have tremendous 
implications for PDT. 

No matter how small or large the sample, we must contend with the limitations 
of sampling whenever we use sample data to learn about a population of interest. 
Although sampling error decreases as the size of the sample increases, an extremely 
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large sample can indeed suffer from failure to be representative of the population of 
interest because of nonsampling error. 

Finally, recognize that every random sample (even an extremely large sample) will 
suffer from some degree of sampling error, and eliminating all potential sources of non- 
sampling error may be impractical. Understanding these limitations of sampling will enable 
us to be more realistic when interpreting sample data and using sample data to draw 
conclusions about the target population. In the next several chapters we will explore 
statistical methods for dealing with these issues in greater detail. 


NOTES + COMMENTS 


1. 


In the previous section of this chapter, we explained that 
one reason probability sampling methods are generally 
preferred over nonprobability sampling methods is that 
formulas are available for evaluating the “goodness” of 
the sample results in terms of the closeness of the results 
to the population parameters being estimated for sample 
data collected using probability sampling methods. This 
evaluation cannot be made with convenience or judgment 
sampling. Another reason for the general preference of 
probability sampling methods over nonprobability sam- 
pling methods is that probability sampling methods are 
less likely than nonprobability sampling methods to in- 
troduce nonsampling error. Although nonsampling can 
occur when either a probability sampling method or a 


EXERCISES 


Methods 


nonprobability sampling method is used, nonprobability 
sampling methods such as convenience sampling and 
judgment sampling frequently introduce nonsampling er- 
ror into the sample data. This is because of the manner in 
which sample data are collected when using a nonproba- 
bility sampling method. 


. Several approaches to statistical inference (interval estima- 


tion and hypothesis testing) are introduced in subsequent 
chapters of this book. These approaches assume non- 
sampling error has not been introduced into the sample 
data. The reliability of the results of statistical inference 
decreases as greater nonsampling error is introduced into 
the sample data. 


38. A population has a mean of 400 and a standard deviation of 100. A sample of size 
100,000 will be taken, and the sample mean x will be used to estimate the popula- 


tion mean. 


a. What is the expected value of x? 
b. What is the standard deviation of x? 
c. Show the sampling distribution of x. 
d. What does the sampling distribution of x show? 

39. Assume the population standard deviation is o = 25. Compute the standard error of 
the mean, o;, for sample sizes of 500,000; 1,000,000; 5,000,000; 10,000,000; and 
100,000,000. What can you say about the size of the standard error of the mean as the 


sample size is increased? 


40. A sample of size 100,000 is selected from a population with p = .75. 

a. What is the expected value of p? 

b. What is the standard error of p? 

c. Show the sampling distribution of p. 

d. What does the sampling distribution of p show? 

41. Assume that the population proportion is .44. Compute the standard error of the 
proportion, Gz, for sample sizes of 500,000; 1,000,000; 5,000,000; 10,000,000; and 
100,000,000. What can you say about the size of the standard error of the sample 
proportion as the sample size is increased? 
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Applications 

42. Sampling Distribution of App Users. Martina Levitt, director of marketing for the 
messaging app Spontanversation, has been assigned the task of profiling users of this 
app. Assume that individuals who have downloaded Spontanversation use the app an 
average of 30 times per day with a standard deviation of 6. 

a. What is the sampling distribution of x if a random sample of 50 individuals who 
have downloaded Spontanversation is used? 

b. What is the sampling distribution of x if a random sample of 500,000 individuals 
who have downloaded Spontanversation is used? 

c. What general statement can you make about what happens to the sampling distri- 
bution of x as the sample size becomes extremely large? Does this generalization 
seem logical? Explain. 

43. Profiling App Users. Consider the Spontanversation sampling problem for which 
individuals who have downloaded Spontanversation use the app an average of 30 times 
per day with a standard deviation of 6. 

a. What is the probability that x is within 1.7 of the population mean if a random sam- 
ple of 50 individuals who have downloaded Spontanversation is used? 

b. What would you conclude if the mean of the sample collected in part (a) is 30.2? 

c. What is the probability that x is within .017 of the population mean if a sample of 
500,000 individuals who have downloaded Spontanversation is used? 

d. What would you conclude if the mean of the sample collected in part (c) is 30.2? 

44. Vacation Hours Earned by Blue-Collar and Service Employees. The U.S. Bureau 
of Labor Statistics (BLS) reported that the mean annual number of hours of vacation 
time earned by blue-collar and service employees who work for small private estab- 
lishments and have at least 10 years of service is 100. Assume that for this population 
the standard deviation for the annual number of vacation hours earned is 48. Suppose 
the BLS would like to select a sample of 15,000 individuals from this population for a 
follow-up study. 

a. Show the sampling distribution of x, the sample mean for a sample of 15,000 indi- 
viduals from this population. 

b. What is the probability that a simple random sample of 15,000 individuals from this 
population will provide a sample mean that is within one hour of the population 
mean? 

c. Suppose the mean annual number of hours of vacation time earned for a sample 
of 15,000 blue-collar and service employees who work for small private establish- 
ments and have at least 10 years of service differs from the population mean pu by 
more than one hour. Considering your results for part (b), how would you interpret 
this result? 

45. MPG for New Cars. The New York Times reported that 17.2 million new cars and 
light trucks were sold in the United States in 2017, and the U.S. Environmental Pro- 
tection Agency projects the average efficiency for these vehicles to be 25.2 miles per 
gallon. Assume that the population standard deviation in miles per gallon for these 
automobiles is o = 6. 

a. What is the probability a sample of 70,000 new cars and light trucks sold in the 
United States in 2017 will provide a sample mean miles per gallon that is within .05 
miles per gallon of the population mean of 25.2? 

b. What is the probability a sample of 70,000 new cars and light trucks sold in the 
United States in 2017 will provide a sample mean miles per gallon that is within .01 
miles per gallon of the population mean of 25.2? Compare this probability to the 
value computed in part (a). 

c. What is the probability a sample of 90,000 new cars and light trucks sold in the 
United States in 2017 will provide a sample mean miles per gallon that is within .01 
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of the population mean of 25.2? Comment on the differences between this probabil- 
ity and the value computed in part (b). 

d. Suppose the mean miles per gallon for a sample of 70,000 new cars and light trucks 
sold in the United States in 2017 differs from the population mean u by more than 
one gallon. How would you interpret this result? 

46. Genders of Entrepreneurs. The Wall Street Journal reported that 37% of all entre- 
preneurs who opened new U.S. businesses in the previous year were female. 

a. Suppose a random sample of 300 entrepreneurs who opened new U.S. businesses 
in the previous year will be taken to learn about which industries are most appeal- 
ing to entrepreneurs. Show the sampling distribution of p, where p is the sample 
proportion of entrepreneurs who opened new U.S. businesses in the previous year 
that are female. 

b. What is the probability that the sample proportion in part (a) will be within +.05 of 
its population proportion? 

c. Suppose a random sample of 30,000 entrepreneurs who opened new U.S. businesses 
in the previous year will be taken to learn about which industries are most appealing 
to entrepreneurs. Show the sampling distribution of p where p is the sample propor- 
tion of entrepreneurs who opened new U.S. businesses last year that are female. 

d. What is the probability that the sample proportion in part (c) will be within +.05 of 
its population proportion? 

e. Is the probability different in parts (b) and (d)? Why? 

47. Customers’ Ages. The vice president of sales for Blasterman Cosmetics, Inc. believes 
that 40% of the company’s orders come from customers who are less than 30 years 
old. A random sample of 10,000 orders will be used to estimate the proportion of cus- 
tomers who are less than 30 years old. 

a. Assume that the vice president of sales is correct and p = .40. What is the sampling 
distribution of p for this study? 

b. What is the probability that the sample proportion will be between .37 and .43? 

c. What is the probability that the sample proportion will be between .39 and .41? 

d. What would you conclude if the sample proportion is .36? 

48. Repeat Purchases. The president of Colossus.com, Inc., believes that 42% of the 
firm’s orders come from customers who have purchased from Colossus.com in the 
past. A random sample of 108,700 orders from the past six months will be used to 
estimate the proportion of orders placed by repeat customers. 

a. Assume that Colossus.com’s president is correct and the population proportion 
p = A2. What is the sampling distribution of p for this study? 

b. What is the probability that the sample proportion p will be within .1% of the popu- 
lation proportion? 

c. What is the probability that the sample proportion p will be within .25% of the 
population proportion? Comment on the difference between this probability and the 
value computed in part (b). 

d. Suppose the proportion of orders placed by repeat customers for a sample of 
108,700 orders from the past six months differs from the population proportion p 
by more than 1%. How would you interpret this result? 

49. Landline Telephone Service. According to the U.S. Department of Health and 
Human Services, only 49.2% of homes in the United States used landline telephone 
service in 2017. 

a. Suppose a sample of 207,000 U.S. homes will be taken to learn about home tele- 
phone usage. Show the sampling distribution of p where p is the sample proportion 
of homes that use landline phone service. 

b. What is the probability that the sample proportion in part (a) will be within 
+ .002 of the population proportion? 
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c. Suppose a sample of 86,800 entrepreneurs will be taken to learn about home tele- 
phone usage. Show the sampling distribution of p where p is the sample proportion 
of homes that use landline phone service. 

d. What is the probability that the sample proportion in part (c) will be within 
+ .002 of the population proportion? 

e. Are the probabilities different in parts (b) and (d)? Why or why not? 


oe oai 


In this chapter we presented the concepts of sampling and sampling distributions. We 
demonstrated how a simple random sample can be selected from a finite population and 
how a random sample can be selected from an infinite population. The data collected from 
such samples can be used to develop point estimates of population parameters. Because 
different samples provide different values for the point estimators, point estimators such 
as x and p are random variables. The probability distribution of such a random variable is 
called a sampling distribution. In particular, we described in detail the sampling distribu- 
tions of the sample mean x and the sample proportion p. 

In considering the characteristics of the sampling distributions of x and p, we stated that 
E(x) = wand E(p) = p. After developing the standard deviation or standard error formulas 
for these estimators, we described the conditions necessary for the sampling distributions 
of x and p to follow a normal distribution. Other sampling methods including stratified 
random sampling, cluster sampling, systematic sampling, convenience sampling, and 
judgment sampling were discussed. Finally, we discussed the concept of big data and the 
ramifications of extremely large samples on the sampling distributions of the sample mean 
and the sample proportion. 


ee eari 


Big data Any set of data that is too large or too complex to be handled by standard 
data-processing techniques and typical desktop software. 

Central limit theorem A theorem that enables one to use the normal probability distribu- 
tion to approximate the sampling distribution of x whenever the sample size is large. 
Cluster sampling A probability sampling method in which the population is first divided 
into clusters and then a simple random sample of the clusters is taken. 

Convenience sampling A nonprobability method of sampling whereby elements are 
selected for the sample on the basis of convenience. 

Coverage error Nonsampling error that results when the research objective and the popu- 
lation from which the sample is to be drawn are not aligned. 

Finite population correction factor The term V(N — n)/(N — 1) that is used in the 
formulas for oz and o;, whenever a finite population, rather than an infinite population, 

is being sampled. The generally accepted rule of thumb is to ignore the finite population 
correction factor whenever n/N = .05. 

Frame A listing of the elements the sample will be selected from. 

Judgment sampling A nonprobability method of sampling whereby elements are selected 
for the sample based on the judgment of the person doing the study. 

Measurement error Nonsampling error that results from the incorrect or imprecise mea- 
surement of the population characteristic of interest. 

Nonresponse error Nonsampling error that results when potential respondents that belong 
to some segment(s) of the population are less likely to respond to the survey mechanism 
than potential respondents that belong to other segments of the population. 

Nonsampling error All types of errors other than sampling error, such as coverage error, 
nonresponse error, measurement error, interviewer error, and processing error. 
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Parameter A numerical characteristic of a population, such as a population mean p, a 
population standard deviation ø, or a population proportion p. 

Point estimate The value of a point estimator used in a particular instance as an estimate 
of a population parameter. 

Point estimator The sample statistic, such as x, s, or p, that provides the point estimate of 
the population parameter. 

Random sample A random sample from an infinite population is a sample selected such 
that the following conditions are satisfied: (1) Each element selected comes from the same 
population; (2) each element is selected independently. 

Sample statistic A sample characteristic, such as a sample mean x, a sample standard 
deviation s, or a sample proportion p. The value of the sample statistic is used to estimate 
the value of the corresponding population parameter. 

Sampled population The population from which the sample is taken. 

Sampling distribution A probability distribution consisting of all possible values of a 
sample statistic. 

Sampling error The error that occurs because a sample, and not the entire population, is 
used to estimate a population parameter. 

Simple random sample A simple random sample of size n from a finite population of 
size N is a sample selected such that each possible sample of size n has the same probabil- 
ity of being selected. 

Standard error The standard deviation of a point estimator. 

Stratified random sampling A probability sampling method in which the popula- 
tion is first divided into strata and a simple random sample is then taken from each 
stratum. 

Systematic sampling A probability sampling method in which we randomly select one of 
the first k elements and then select every kth element thereafter. 

Tall data A data set that has so many observations that traditional statistical inference has 
little meaning. 

Target population The population for which statistical inferences such as point estimates 
are made. It is important for the target population to correspond as closely as possible to 
the sampled population. 

Unbiased A property of a point estimator that is present when the expected value of the 
point estimator is equal to the population parameter it estimates. 

Variety The diversity in types and structures of the data generated. 

Velocity The speed at which the data are generated. 

Veracity The reliability of the data generated. 

Volume The amount of data generated. 

Wide data A data set that has so many variables that simultaneous consideration of all 
variables is infeasible. 


KEY FORMULAS . 


Expected Value of x 


E(x) = mw (7.1) 


Standard Deviation of x (Standard Error) 


Finite Population Infinite Population 
N-n[ o Co 
_ _= — 7.2 
É N-1 (=) = Vn Ga) 
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Expected Value of p 
E(p) =p (7.4) 
Standard Deviation of p (Standard Error) 


Finite Population Infinite Population 


N-n |[pd-p) pl — p) 
a= ety n GBEN n ee) 


SUPPLEMENTARY EXERCISES 


aS . 50. Shadow Stocks. Jack Lawler, a financial analyst, wants to prepare an article on the 

7 DATA file . : coe a 

a Shadow Stock portfolio developed by the American Association of Individual Inves- 
ShadowStocke tors (AAII). A list of the 30 companies in the Shadow Stock portfolio as of March 


2014 is contained in the file ShadowStocks (AAII website). Jack would like to select a 
simple random sample of 5 of these companies for an interview concerning manage- 
ment practices. 

a. In the file ShadowStocks, the Shadow Stock companies are listed in column A of 
an Excel worksheet. In column B we have generated a random number for each of 
the companies. Use these random numbers to select a simple random sample of 5 of 
these companies for Jack. 

b. Generate a new set of random numbers and use them to select a new simple random 
sample. Did you select the same companies? 

51. Personal Health Expenditures. Data made available through the Petersen-Kaiser 
Health System Tracker in May 2018 showed health expenditures were $10,348 per 
person in the United States. Use $10,348 as the population mean and suppose a survey 
research firm will take a sample of 100 people to investigate the nature of their health 
expenditures. Assume the population standard deviation is $2500. 

a. Show the sampling distribution of the mean amount of health care expenditures for 
a sample of 100 people. 

b. What is the probability the sample mean will be within +$200 of the population 
mean? 

c. What is the probability the sample mean will be greater than $12,000? If the survey 
research firm reports a sample mean greater than $12,000, would you question 
whether the firm followed correct sampling procedures? Why or why not? 

52. Foot Locker Store Productivity. According to a report in The Wall Street Journal, 
Foot Locker uses sales per square foot as a measure of store productivity. Sales are 
currently running at an annual rate of $406 per square foot. You have been asked by 
management to conduct a study of a sample of 64 Foot Locker stores. Assume the 
standard deviation in annual sales per square foot for the population of all 3400 Foot 
Locker stores is $80. 

a. Show the sampling distribution of x, the sample mean annual sales per square foot 
for a sample of 64 Foot Locker stores. 

b. What is the probability that the sample mean will be within $15 of the population 
mean? 

c. Suppose you find a sample mean of $380. What is the probability of finding a 
sample mean of $380 or less? Would you consider such a sample to be an unusually 
low performing group of stores? 

53. Airline Fares. The mean airfare for flights departing from Buffalo Niagara 
International Airport during the first three months of 2017 was $320.51. Assume 
the standard deviation for this population of fares is known to be $80. Suppose a 
random sample of 60 flights departing from Buffalo Niagara International Airport 
during the first three months of 2018 is taken. 
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a. If the mean and standard deviation of the population of airfares for flights 
departing from Buffalo Niagara International Airport didn’t changed between 
the first three months of 2017 and the first three months of 2018, what is the 
probability the sample mean will be within $20 of the population mean cost 
per flight? 

b. What is the probability the sample mean will be within $10 of the population mean 
cost per flight? 

54. University Costs. After deducting grants based on need, the average cost to attend the 
University of Southern California (USC) is $27,175. Assume the population standard 
deviation is $7400. Suppose that a random sample of 60 USC students will be taken 
from this population. 

a. What is the value of the standard error of the mean? 

b. What is the probability that the sample mean will be more than $27,175? 

c. What is the probability that the sample mean will be within $1000 of the population 
mean? 

d. How would the probability in part (c) change if the sample size were increased to 
100? 

55. Inventory Costs. Three firms carry inventories that differ in size. Firm A’s inventory 
contains 2000 items, firm B’s inventory contains 5000 items, and firm C’s inventory 
contains 10,000 items. The population standard deviation for the cost of the items in 
each firm’s inventory is o = 144. A statistical consultant recommends that each firm 
take a sample of 50 items from its inventory to provide statistically valid estimates of 
the average cost per item. Employees of the small firm state that because it has the 
smallest population, it should be able to make the estimate from a much smaller sam- 
ple than that required by the larger firms. However, the consultant states that to obtain 
the same standard error and thus the same precision in the sample results, all firms 
should use the same sample size regardless of population size. 

a. Using the finite population correction factor, compute the standard error for each of 
the three firms given a sample of size 50. 

b. What is the probability that for each firm the sample mean x will be within +25 of 
the population mean u? 

56. Survey Research Results. A researcher reports survey results by stating that the stan- 
dard error of the mean is 20. The population standard deviation is 500. 

a. How large was the sample used in this survey? 

b. What is the probability that the point estimate was within +25 of the population 
mean? 

57. Production Quality Control. A production process is checked periodically by a 
quality control inspector. The inspector selects simple random samples of 30 finished 
products and computes the sample mean product weights x. If test results over a long 
period of time show that 5% of the x values are over 2.1 pounds and 5% are under 
1.9 pounds, what are the mean and the standard deviation for the population of prod- 
ucts produced with this process? 

58. Australians and Smoking. Fifteen percent of Australians smoke. Reuters reports that 
by introducing tough laws banning brand labels on cigarette packages, Australia hopes 
to reduce the percentage of people smoking to 10% by 2018. Answer the following 
questions based on a sample of 240 Australians. 

a. Show the sampling distribution of p, the proportion of Australians who are smokers. 

b. What is the probability the sample proportion will be within +.04 of the population 
proportion? 

c. What is the probability the sample proportion will be within +.02 of the population 
proportion? 

59. Marketing Research Telephone Surveys. A market research firm conducts telephone 
surveys with a 40% historical response rate. What is the probability that in a new sam- 
ple of 400 telephone numbers, at least 150 individuals will cooperate and respond to 
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the questions? In other words, what is the probability that the sample proportion will 

be at least 150/400 = .375? 

60. Internet Advertising. Advertisers contract with Internet service providers and search 
engines to place ads on websites. They pay a fee based on the number of potential 
customers who click on their ad. Unfortunately, click fraud—the practice of some- 
one clicking on an ad solely for the purpose of driving up advertising revenue—has 
become a problem. According to BusinessWeek, 40% of advertisers claim they have 
been a victim of click fraud. Suppose a simple random sample of 380 advertisers will 
be taken to learn more about how they are affected by this practice. 

a. What is the probability that the sample proportion will be within +.04 of the popu- 
lation proportion experiencing click fraud? 

b. What is the probability that the sample proportion will be greater than .45? 

61. Traffic Tickets. The proportion of individuals insured by the All-Driver Automobile In- 
surance Company who received at least one traffic ticket during a five-year period is .15. 
a. Show the sampling distribution of p if a random sample of 150 insured individuals 

is used to estimate the proportion having received at least one ticket. 

b. What is the probability that the sample proportion will be within +.03 of the popu- 
lation proportion? 

62. Textbook Publishing. Lori Jeffrey is a successful sales representative for a major 
publisher of college textbooks. Historically, Lori obtains a book adoption on 25% 
of her sales calls. Viewing her sales calls for one month as a sample of all possible 
sales calls, assume that a statistical analysis of the data yields a standard error of the 
proportion of .0625. 

a. How large was the sample used in this analysis? That is, how many sales calls did 
Lori make during the month? 

b. Let p indicate the sample proportion of book adoptions obtained during the month. 
Show the sampling distribution of p. 

c. Using the sampling distribution of p, compute the probability that Lori will obtain 
book adoptions on 30% or more of her sales calls during a one-month period. 

63. Life of Compact Fluorescent Lights. In 2018, the Simple Dollar website reported 
that the mean life of 14-watt compact fluorescent lights (CFLs) is 8000 hours. Assume 
that for this population the standard deviation for CFL life is 480. Suppose the U.S. 
Department of Energy would like to select a random sample of 35,000 from the popu- 
lation of 14-watt CFLs for a follow-up study. 

a. Show the sampling distribution of x, the sample mean for a sample of 35,000 indi- 
viduals from this population. 

b. What is the probability that a simple random sample of 35,000 individuals 
from this population will provide a sample mean that is within four hours of the 
population mean? 

c. What is the probability that a simple random sample of 35,000 individuals from 
this population will provide a sample mean that is within one hour of the popula- 
tion mean? 

d. Suppose the mean life of a sample of 35,000 14-watt CFLs differs from the popula- 
tion mean life by more than four hours. How would you interpret this result? 

64. Typical Home Internet Usage. According to USC Annenberg, the mean time 
spent by Americans on the Internet in their home per week is 17.6 hours. Assume 
that the standard deviation for the time spent by Americans on the Internet in their 
home per week is 5.1 hours. Suppose the Florida Department of State plans to 
select a random sample of 85,020 of the state’s residents for a study of Floridians’ 
Internet usage. 

a. Using the U.S. population figures provided in the problem (the population mean 
and standard deviation of time spent by Americans on the Internet in their home per 
week are 17.6 hours and 5.1 hours, respectively), what is the sampling distribution 
of the sample mean for the sample of 85,020 Floridians? 
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b. Using the sampling distribution from part (a), what is the probability that a random 
sample of 85,020 Floridians will provide a sample mean that is within three min- 
utes of the population mean? 

c. Suppose the mean time spent on the Internet in their home per week by the sample 
of 85,020 Floridians differs from the U.S population mean by more than three min- 
utes? How would you interpret this result? 

65. Undeliverable Mail Pieces. Of the 155 billion mailpieces the U.S. Postal Service 
(USPS) processed and delivered in 2017, 4.3% were undeliverable as addressed. Sup- 
pose that a brief questionnaire about USPS service is attached to each mailpiece in a 
random sample of 114,250 mailpieces. 

a. What is the sampling distribution of the sample proportion of undeliverable mail- 
pieces p for this study? 

b. What is the probability that the sample proportion of undeliverable mailpieces p 
will be within .1% of the population proportion of undeliverable mailpieces? 

c. What is the probability that the sample proportion of undeliverable mailpieces 
p will be within .05% of the population proportion of undeliverable mailpieces? 
Comment on the difference between this probability and the probability computed 
in part (b). 

66. U.S. Drivers and Speeding. ABC News reports that 58% of U.S. drivers admit to 
speeding. Suppose that a new satellite technology can instantly measure the speed of 
any vehicle on a U.S. road and determine whether the vehicle is speeding, and this sat- 
ellite technology was used to take a random sample of 20,000 vehicles at 6 P.M. EST 
on a recent Tuesday afternoon. 

a. For this investigation, what is the sampling distribution for sample proportion of 
vehicles on U.S. roads that speed? 

b. What is the probability that the sample proportion of speeders p will be within 1% 
of the population proportion of speeders? 

c. Suppose the sample proportion of speeders p differs from the U.S population pro- 
portion of seeders by more than 1%? How would you interpret this result? 


CASE PROBLEM: MARION DAIRIES 


Last year Marion Dairies decided to enter the yogurt market, and it began cautiously by 
producing, distributing, and marketing a single flavor—a blueberry-flavored yogurt that it 
calls Blugurt. The company’s initial venture into the yogurt market has been very suc- 
cessful; sales of Blugurt are higher than expected, and consumers’ ratings of the product 
have a mean of 80 and a standard deviation of 25 on a 100-point scale for which 100 is the 
most favorable score and zero is the least favorable score. Past experience has also shown 
Marion Dairies that a consumer who rates one of its products with a score greater than 

75 on this scale will consider purchasing the product, and a score of 75 or less indicates the 
consumer will not consider purchasing the product. 

Emboldened by the success and popularity of its blueberry-flavored yogurt, Marion 
Dairies management is now considering the introduction of a second flavor. Marion’s 
marketing department is pressing to extend the product line through the introduction of a 
strawberry-flavored yogurt that would be called Strawgurt, but senior managers are con- 
cerned about whether or not Strawgurt will increase Marion’s market share by appealing to 
potential customers who do not like Blugurt. That is, the goal in offering the new product 
is to increase Marion’s market share rather than cannibalize existing sales of Blugurt. The 
marketing department has proposed giving tastes of both Blugurt and Strawgurt to a simple 
random sample of 50 customers and asking each of them to rate the two flavors of yogurt 
on the 100-point scale. If the mean score given to Blugurt by this sample of consumers is 
75 or less, Marion’s senior management believes the sample can be used to assess whether 
Strawgurt will appeal to potential customers who do not like Blugurt. 
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Managerial Report 
Prepare a managerial report that addresses the following issues. 


1. Calculate the probability the mean score of Blugurt given by the simple random sample 
of Marion Dairies customers will be 75 or less. 

2. If the Marketing Department increases the sample size to 150, what is the probabil- 
ity the mean score of Blugurt given by the simple random sample of Marion Dairies 
customers will be 75 or less? 

3. Explain to Marion Dairies senior management why the probability that the mean score 
of Blugurt for a random sample of Marion Dairies customers will be 75 or less is differ- 
ent for samples of 50 and 150 Marion Dairies customers. 
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Chapter 8 Interval Estimation 


STATISTICS IN PRACTICE 


Food Lion* 
SALISBURY, NORTH CAROLINA 


Founded in 1957 as Food Town, Food Lion is one of the 
largest supermarket chains in the United States, with 
over 1000 stores in 10 Southeastern and Mid-Atlantic 
states. The company sells more than 24,000 different 
products and offers nationally and regionally advertised 
brand-name merchandise, as well as a growing number 
of high-quality private label products manufactured 
especially for Food Lion. The company maintains its low 
price leadership and quality assurance through operat- 
ing efficiencies such as standard store formats, innova- 
tive warehouse design, energy-efficient facilities, and 
data synchronization with suppliers. Food Lion looks to 
a future of continued innovation, growth, price leader- 
ship, and service to its customers. 

Being in an inventory-intense business, Food Lion 
made the decision to adopt the LIFO (last-in, first-out) 
method of inventory valuation. This method matches 
current costs against current revenues, which mini- 
mizes the effect of radical price changes on profit and 
loss results. In addition, the LIFO method reduces net 
income, thereby reducing income taxes during periods 
of inflation. 

Food Lion establishes a LIFO index for each of 
seven inventory pools: Grocery, Paper/Household, 

Pet Supplies, Health & Beauty Aids, Dairy, Cigarettes/ 
Tobacco, and Beer/Wine. For example, a LIFO index 
of 1.008 for the Grocery pool would indicate that the 
company’s grocery inventory value at current costs 
reflects a 0.8% increase due to inflation over the most 
recent one-year period. 

A LIFO index for each inventory pool requires that the 
year-end inventory count for each product be valued at 


*The authors are indebted to Keith Cunningham, Tax Director, and 
Bobby Harkey, Staff Tax Accountant, at Food Lion for providing 
this Statistics in Practice. 


FOOD|@s|LION 


© Davis Turner/Bloomberg/Getty Images 


the current year-end cost and at the preceding year-end 
cost. To avoid excessive time and expense associated 
with counting the inventory in all store locations, Food 
Lion selects a random sample of 50 stores. Year-end 
physical inventories are taken in each of the sample 
stores. The current-year and preceding-year costs for 
each item are then used to construct the required LIFO 
indexes for each inventory pool. 

For a recent year, the sample estimate of the LIFO 
index for the Health & Beauty Aids inventory pool 
was 1.015. Using a 95% confidence level, Food Lion 
computed a margin of error of .006 for the sample 
estimate. Thus, the interval from 1.009 to 1.021 pro- 
vided a 95% confidence interval estimate of the popu- 
lation LIFO index. This level of precision was judged 
to be very good. 

In this chapter you will learn how to compute the 
margin of error associated with sample estimates. You 
will also learn how to use this information to construct 
and interpret interval estimates of a population mean 
and a population proportion. 


In Chapter 7, we stated that a point estimator is a sample statistic used to estimate a 
population parameter. For instance, the sample mean x is a point estimator of the popula- 
tion mean u and the sample proportion p is a point estimator of the population proportion 
p. Because a point estimator cannot be expected to provide the exact value of the popula- 
tion parameter, an interval estimate is often computed by adding and subtracting a value, 
called the margin of error, to the point estimate. The general form of an interval estimate 


is as follows: 


Point estimate + Margin of error 


The purpose of an interval estimate is to provide information about how close the point 
estimate, provided by the sample, is to the value of the population parameter. 
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In this chapter we show how to compute interval estimates of a population mean u and 
a population proportion p. The general form of an interval estimate of a population mean is 


x + Margin of error 
Similarly, the general form of an interval estimate of a population proportion is 
p + Margin of error 


The sampling distributions of x and p play key roles in computing these interval estimates. 


8.1 Population Mean: o Known 


In order to develop an interval estimate of a population mean, either the population stan- 
dard deviation o or the sample standard deviation s must be used to compute the margin of 
error. In most applications ø is not known, and s is used to compute the margin of error. In 
some applications, however, large amounts of relevant historical data are available and can 
be used to estimate the population standard deviation prior to sampling. Also, in quality 
control applications where a process is assumed to be operating correctly, or “in control,” it 
is appropriate to treat the population standard deviation as known. We refer to such cases as 
o known cases. In this section we introduce an example in which it is reasonable to treat 7 
as known and show how to construct an interval estimate for this case. 

Each week Lloyd’s Department Store selects a simple random sample of 100 customers 
in order to learn about the amount spent per shopping trip. With x representing the amount 
spent per shopping trip, the sample mean x provides a point estimate of u, the mean 
amount spent per shopping trip for the population of all Lloyd’s customers. Lloyd’s has 
been using the weekly survey for several years. Based on the historical data, Lloyd’s now 
assumes a known value of ø = $20 for the population standard deviation. The historical 
data also indicate that the population follows a normal distribution. 

. During the most recent week, Lloyd’s surveyed 100 customers (n = 100) and obtained 
DATA f ile a sample mean of x = $82. The sample mean amount spent provides a point estimate of 
Lloyd’s the population mean amount spent per shopping trip, m. In the discussion that follows, we 
show how to compute the margin of error for this estimate and develop an interval estimate 
of the population mean. 


(@ 


Margin of Error and the Interval Estimate 


In Chapter 7 we showed that the sampling distribution of x can be used to compute the 
probability that x will be within a given distance of u. In the Lloyd’s example, the histori- 
cal data show that the population of amounts spent is normally distributed with a standard 
deviation of ø = 20. So, using what we learned in Chapter 7, we can conclude that the sam- 
pling distribution of x follows a normal distribution with a standard error of o; = o¢/Vn = 
20/100 = 2. This sampling distribution is shown in Figure 8.1.' Because the sampling 
distribution shows how values of x are distributed around the population mean pw, the sam- 
pling distribution of x provides information about the possible differences between x and u. 
Using the standard normal probability table, we find that 95% of the values of any 
normally distributed random variable are within + 1.96 standard deviations of the mean. 
Thus, when the sampling distribution of x is normally distributed, 95% of the x values 
must be within + 1.960, of the mean u. In the Lloyd’s example, we know that the sam- 
pling distribution of x is normally distributed with a standard error of 0, = 2. Because 
+1.960, = 1.96(2) = 3.92, we can conclude that 95% of all x values obtained using a 
sample size of n = 100 will be within +3.92 of the population mean p. See Figure 8.2. 


'We use the fact that the population of amounts spent has a normal distribution to conclude that the sampling 
distribution of x has a normal distribution. If the population did not have a normal distribution, we could rely on the 
central limit theorem and the sample size of n = 100 to conclude that the sampling distribution of x is approximately 
normal. In either case, the sampling distribution of x would appear as shown in Figure 8.1. 
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FIGURE 8.1 Sampling Distribution of the Sample Mean Amount Spent from 
Simple Random Samples of 100 Customers 


Sampling distribution 
of ¥ 


FIGURE 8.2 Sampling Distribution of x Showing the Location of Sample 
Means That Are Within 3.92 of u 


Sampling distribution 
of ¥ 


In the introduction to this chapter, we said that the general form of an interval estimate 
of the population mean w is x + margin of error. For the Lloyd’s example, suppose we set 
the margin of error equal to 3.92 and compute the interval estimate of u using x + 3.92. 
To provide an interpretation for this interval estimate, let us consider the values of x that 
could be obtained if we took three different simple random samples, each consisting of 
100 Lloyd’s customers. The first sample mean might turn out to have the value shown 
as x, in Figure 8.3. In this case, Figure 8.3 shows that the interval formed by subtracting 
3.92 from x, and adding 3.92 to x, includes the population mean u. Now consider what 
happens if the second sample mean turns out to have the value shown as x, in Figure 8.3. 
Although this sample mean differs from the first sample mean, we see that the interval 
formed by subtracting 3.92 from x, and adding 3.92 to x, also includes the population 
mean u. However, consider what happens if the third sample mean turns out to have the 
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FIGURE 8.3 Intervals Formed from Selected Sample Means at Locations 
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value shown as x, in Figure 8.3. In this case, the interval formed by subtracting 3.92 from 
x, and adding 3.92 to x, does not include the population mean u. Because x, falls in the 
upper tail of the sampling distribution and is farther than 3.92 from p, subtracting and 
adding 3.92 to x, forms an interval that does not include p. 

Any sample mean x that is within the darkly shaded region of Figure 8.3 will provide an 
interval that contains the population mean u. Because 95% of all possible sample means 
are in the darkly shaded region, 95% of all intervals formed by subtracting 3.92 from x and 
adding 3.92 to x will include the population mean u. 

Recall that during the most recent week, the quality assurance team at Lloyd’s surveyed 
100 customers and obtained a sample mean amount spent of x = 82. Using x + 3.92 to 
construct the interval estimate, we obtain 82 + 3.92. Thus, the specific interval estimate 
of u based on the data from the most recent week is 82 — 3.92 = 78.08 to 82 + 3.92 = 
85.92. Because 95% of all the intervals constructed using x + 3.92 will contain the popu- 
lation mean, we say that we are 95% confident that the interval 78.08 to 85.92 includes the 
population mean u. We say that this interval has been established at the 95% confidence 
level. The value .95 is referred to as the confidence coefficient, and the interval 78.08 to 
85.92 is called the 95% confidence interval. 

Another term sometimes associated with an interval estimate is the level of significance. 
The level of significance associated with an interval estimate is denoted by the Greek letter 
a. The level of significance and the confidence coefficient are related as follows: 


This discussion provides in- 
sight as to why the interval 
is called a 95% confidence 
interval. 


a = Level of significance = 1 — Confidence coefficient 


The level of significance is the probability that the interval estimation procedure will 
generate an interval that does not contain u. For example, the level of significance 
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corresponding to a .95 confidence coefficient is œ = 1 — .95 = .05. In the Lloyd’s case, 
the level of significance (a = .05) is the probability of drawing a sample, computing the 
sample mean, and finding that x lies in one of the tails of the sampling distribution (see x, 
in Figure 8.3). When the sample mean happens to fall in the tail of the sampling distribu- 
tion (and it will 5% of the time), the confidence interval generated will not contain m. 

With the margin of error given by (z,,.0/V/n), the general form of an interval estimate 
of a population mean for the ø known case follows. 


INTERVAL ESTIMATE OF A POPULATION MEAN: co KNOWN 


o 
i 2h p= 8.1 

Zal Vn ( ) 
where (1 — æ) is the confidence coefficient and z,/, is the z value providing an area of 
a/2 in the upper tail of the standard normal probability distribution. 


Let us use expression (8.1) to construct a 95% confidence interval for the Lloyd’s 
example. For a 95% confidence interval, the confidence coefficient is (1 — a) = .95 and 
thus, œ = .05. Using the standard normal probability table, an area of a/2 = .05/2 = .025 
in the upper tail provides zo; = 1.96. With the Lloyd’s sample mean x = 82, 0 = 20, and 
a sample size n = 100, we obtain 


82+ 1 ee 
~~ V100 
82 + 3.92 


Thus, using expression (8.1), the margin of error is 3.92 and the 95% confidence interval is 
82 — 3.92 = 78.08 to 82 + 3.92 = 85.92. 

Although a 95% confidence level is frequently used, other confidence levels such as 
90% and 99% may be considered. Values of z,,. for the most commonly used confidence 
levels are shown in Table 8.1. Using these values and expression (8.1), the 90% confidence 
interval for the Lloyd’s example is 


20 
82 + 1.645 ——— 
v 100 
82 £3.29 


Thus, at 90% confidence, the margin of error is 3.29 and the confidence interval is 
82 — 3.29 = 78.71 to 82 + 3.29 = 85.29. Similarly, the 99% confidence interval is 


20 
82 + 2.576 ——— 
v 100 
82 + 5.15 


Thus, at 99% confidence, the margin of error is 5.15 and the confidence interval is 
82 — 5.15 = 76.85 to 82 + 5.15 = 87.15. 

Comparing the results for the 90%, 95%, and 99% confidence levels, we see that in 
order to have a higher level of confidence, the margin of error and thus the width of the 
confidence interval must be larger. 


TABLE 8.1 Values of z,/. for the Most Commonly Used Confidence Levels 


Confidence Level a a/2 Boppy 
90% o KO) 105 1.645 
95% .05 025 1.960 
99% .01 005 2.576 
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We will use the Lloyd’s Department Store data to illustrate how Excel can be used to con- 
struct an interval estimate of the population mean for the ø known case. Refer to Figure 8.4 
as we describe the tasks involved. The formula worksheet is in the background; the value 
worksheet appears in the foreground. 


Enter/Access Data: Open the file Lloyd’s. A label and the sales data are entered in cells 
Al1:A101. 


Enter Functions and Formulas: The sample size and sample mean are computed in 
cells D4:D5 using Excel’s COUNT and AVERAGE functions, respectively. The value 
worksheet shows that the sample size is 100 and the sample mean is 82. The value of the 
known population standard deviation (20) is entered in cell D7 and the desired confidence 
coefficient (.95) is entered in cell D8. The level of significance is computed in cell D9 

by entering the formula = /-D8; the value worksheet shows that the level of significance 
associated with a confidence coefficient of .95 is .05. The margin of error is computed 

in cell D11 using Excel’s CONFIDENCE.NORM function. The CONFIDENCE.NORM 
function has three inputs: the level of significance (cell D9); the population standard devia- 
tion (cell D7); and the sample size (cell D4). Thus, to compute the margin of error associ- 
ated with a 95% confidence interval, the following formula is entered in cell D11: 


=CONFIDENCE.NORM(D9,D7,D4) 


The resulting value of 3.92 is the margin of error associated with the interval estimate of 
the population mean amount spent per week. 

Cells D13:D15 provide the point estimate and the lower and upper limits for the con- 
fidence interval. Because the point estimate is just the sample mean, the formula =D5 
is entered in cell D13. To compute the lower limit of the 95% confidence interval, x — 
(margin of error), we enter the formula =D/3-D// in cell D14. To compute the upper limit 
of the 95% confidence interval, x + (margin of error), we enter the formula =D/3 + D11 in 
cell D15. The value worksheet shows a lower limit of 78.08 and an upper limit of 85.92. In 
other words, the 95% confidence interval for the population mean is from 78.08 to 85.92. 


FIGURE 8.4 | Excel Worksheet: Constructing a 95% Contidence Interval for Lloyd’s 
Department Store 


B c D E 
Interval Estimate of a Population Mean: 
o Known Case 
Sample Size =COUNT(A1:A101) A e c D E 
Sample Mon EEA nine 4 [Amount | Interval Estimate of a Population Mean: 
Population Standard Deviation 20 2 a Known Case 
Confidence Coefficient 0.95 3 ae 
Level of Significance =1-D8 4 | P 4 Sample Size 100 
5 as Sample Mean 82 
6 | 71 
Margin of Error “CONFIDENCE. NI = ee 
= a | 120 Population Standard Deviation 20 
Polat Estimate =D st Confidence Coefficient 0.95 
Lower Limit =D13-D11 A Level of Significance 0.05 
Upper Limit =D13+D11 10 
11 Margin of Error 3.92 
12 


Point Estimate 82 
Lower Limit 78.08 
Upper Limit 85.92 


Note: Rows 18-99 
are hidden. 
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A Template for Other Problems To use this worksheet as a template for another problem 
of this type, we must first enter the new problem data in column A. Then, the cell formulas 
in cells D4 and D5 must be updated with the new data range and the known population 
standard deviation must be entered in cell D7. After doing so, the point estimate and a 
95% confidence interval will be displayed in cells D13:D15. If a confidence interval with a 
different confidence coefficient is desired, we simply change the value in cell D8. 

We can further simplify the use of Figure 8.4 as a template for other problems by elimi- 
nating the need to enter new data ranges in cells D4 and D5. To do so we rewrite the cell 
formulas as follows: 


Cell D4: =COUNT(A.-A) 
Cell D5: =AVERAGE(A-A) 


The Lloyd's data set 
includes a worksheet enti- 


With the A:A method of specifying data ranges, Excel’s COUNT function will count the 
number of numerical values in column A and Excel’s AVERAGE function will compute 
the average of the numerical values in column A. Thus, to solve a new problem it is only 
necessary to enter the new data into column A and enter the value of the known population 
standard deviation in cell D7. 

This worksheet can also be used as a template for text exercises in which the sample 
size, sample mean, and the population standard deviation are given. In this type of situation 
we simply replace the values in cells D4, D5, and D7 with the given values of the sample 
size, sample mean, and the population standard deviation. 


tled Template that uses the 
A:A method for entering 
the data ranges. 


Practical Advice 


If the population follows a normal distribution, the confidence interval provided by 
expression (8.1) is exact. In other words, if expression (8.1) were used repeatedly to 
generate 95% confidence intervals, exactly 95% of the intervals generated would contain 
the population mean. If the population does not follow a normal distribution, the confi- 
dence interval provided by expression (8.1) will be approximate. In this case, the quality of 
the approximation depends on both the distribution of the population and the sample size. 

In most applications, a sample size of n = 30 is adequate when using expression (8.1) 
to develop an interval estimate of a population mean. If the population is not normally dis- 
tributed, but is roughly symmetric, sample sizes as small as 15 can be expected to provide 
good approximate confidence intervals. With smaller sample sizes, expression (8.1) should 
only be used if the analyst believes, or is willing to assume, that the population distribution 
is at least approximately normal. 


NOTES + COMMENTS 


1. The interval estimation procedure discussed in this section size provides too wide an interval to be of any practical 


is based on the assumption that the population standard 
deviation ø is known. By ø known we mean that historical 
data or other information are available that permit us to 
obtain a good estimate of the population standard devia- 
tion prior to taking the sample that will be used to develop 
an estimate of the population mean. So technically we 
don't mean that ø is actually known with certainty. We just 
mean that we obtained a good estimate of the standard 
deviation prior to sampling and thus we won't be using the 
same sample to estimate both the population mean and 
the population standard deviation. 

The sample size n appears in the denominator of the inter- 
val estimation expression (8.1). Thus, if a particular sample 
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use, we may want to consider increasing the sample size. 
With n in the denominator, a larger sample size will pro- 
vide a smaller margin of error, a narrower interval, and 
greater precision. The procedure for determining the size 
of a simple random sample necessary to obtain a desired 
precision is discussed in Section 8.3. 


. When developing a confidence interval for the mean 


with a sample size that is at least 5% of the population 
size (that is, n/N = .05), the finite population correction 
factor should be used when calculating the standard er- 
ror of the sampling distribution of X when ø is known, i.e., 


2 RA 
a= A = 
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EXERCISES 


Methods 

1. A simple random sample of 40 items resulted in a sample mean of 25. The population 
standard deviation is ø = 5. 
a. What is the standard error of the mean, o;? 
b. At 95% confidence, what is the margin of error? 

2. A simple random sample of 50 items from a population with ø = 6 resulted in a 
sample mean of 32. 
a. Provide a 90% confidence interval for the population mean. 
b. Provide a 95% confidence interval for the population mean. 
c. Provide a 99% confidence interval for the population mean. 

3. A simple random sample of 60 items resulted in a sample mean of 80. The population 
standard deviation is øo = 15. 
a. Compute the 95% confidence interval for the population mean. 
b. Assume that the same sample mean was obtained from a sample of 120 items. 

Provide a 95% confidence interval for the population mean. 

c. What is the effect of a larger sample size on the interval estimate? 

4. A 95% confidence interval for a population mean was reported to be 152 to 160. If 
o = 15, what sample size was used in this study? 


Applications 

5. Restaurant Bills. Data were collected on the amount spent by 64 customers for lunch 
at a major Houston restaurant. These data are contained in the file Houston. Based 
upon past studies the population standard deviation is known with o = $6. 

a. At 99% confidence, what is the margin of error? 
b. Develop a 99% confidence interval estimate of the mean amount spent for lunch. 

6. Travel Taxes. In an attempt to assess total daily travel taxes in various cities, the Global 
Business Travel Association conducted a study of daily travel taxes on lodging, rental 
car, and meals. The data contained in the file TravelTax are consistent with the findings 
of that study for business travel to Chicago. Assume the population standard deviation 
is known to be $8.50 and develop a 95% confidence interval of the population mean 
total daily travel taxes for Chicago. 

. 7. Cost of Dog Ownership. Money reports that the average annual cost of the first year 

DATA f ile of owning and caring for a large dog in 2017 is $1448. The Irish Red and White Setter 
Association of America has requested a study to estimate the annual first-year cost for 
owners of this breed. A sample of 50 will be used. Based on past studies, the population 
standard deviation is assumed known with ø = $255. 

a. What is the margin of error for a 95% confidence interval of the mean cost of the 
first year of owning and caring for this breed? 

b. The file Setters contains data collected from fifty owners of Irish Setters on the 
cost of the first year of owning and caring for their dogs. Use this data set to 
compute the sample mean. Using this sample, what is the 95% confidence inter- 
val for the mean cost of the first year of owning and caring for an Irish Red and 
White Setter? 

8. Cost of Massage Therapy Sessions. The Wall Street Journal reported on several 
studies that show massage therapy has a variety of health benefits and it is not too 
expensive. A sample of 10 typical one-hour massage therapy sessions showed an 
average charge of $59. The population standard deviation for a one-hour session is 
ao = $5.50. 

a. What assumptions about the population should we be willing to make if a margin of 
error is desired? 

b. Using 95% confidence, what is the margin of error? 

c. Using 99% confidence, what is the margin of error? 


S DATA file 


Houston 


S DATA file 


TravelTax 


Setters 
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; 9. Cost to Repair Fire Damage. The mean cost to repair the smoke and fire damage that 

DATA f ile result from home fires of all causes is $11,389 (HomeAdvisor). How does the damage 
that results from home fires caused by careless use of tobacco compare? The file 
TobaccoFires provides the cost to repair smoke and fire damage associated with a sample 
of 55 fires caused by careless use of tobacco products. Using past years’ data, the popula- 
tion standard deviation can be assumed known with o = $3027. What is the 95% confi- 
dence interval estimate of the mean cost to repair smoke and fire damage that results from 
home fires caused by careless use of tobacco? How does this compare with the mean cost 
to repair the smoke and fire damage that result from home fires of all causes? 

10. Assisted-Living Facility Rent. Costs are rising for all kinds of medical care. The mean 
monthly rent at assisted-living facilities was reported to have increased 17% over the 
last five years to $3486. Assume this cost estimate is based on a sample of 120 facilities 
and, from past studies, it can be assumed that the population standard deviation is 
o = $650. 

a. Develop a 90% confidence interval estimate of the population mean monthly rent. 
Develop a 95% confidence interval estimate of the population mean monthly rent. 
Develop a 99% confidence interval estimate of the population mean monthly rent. 

. What happens to the width of the confidence interval as the confidence level is 

increased? Does this seem reasonable? Explain. 


TobaccoFires 


aes 


8.2 Population Mean: o Unknown 


When developing an interval estimate of a population mean, we usually do not have a good 
estimate of the population standard deviation either. In these cases, we must use the same 
sample to estimate both u and ø. This situation represents the o unknown case. When s 
is used to estimate ø, the margin of error and the interval estimate for the population mean 
are based on a probability distribution known as the ¢ distribution. Although the mathem- 
atical development of the ¢ distribution is based on the assumption of a normal distribution 
for the population we are sampling from, research shows that the ¢ distribution can be 
successfully applied in many situations where the population deviates significantly from 
normal. Later in this section we provide guidelines for using the ż distribution if the popu- 
lation is not normally distributed. 
William Sealy Gosset, The ż distribution is a family of similar probability distributions, with a specific t 
writing under the name distribution depending on a parameter known as the degrees of freedom. The ż dis- 
“Student,” is the founder of tribution with 1 degree of freedom is unique, as is the ¢ distribution with 2 degrees of 
the t distribution. Gosset, freedom, with 3 degrees of freedom, and so on. As the number of degrees of freedom 


an Oxford graduate in increases, the difference between the f distribution and the standard normal distri- 
mathematics, worked for bution becomes smaller and smaller. Figure 8.5 shows ¢ distributions with 10 and 

the Guinness Brewery 20 degrees of freedom and their relationship to the standard normal probability distri- 
in Dublin, Ireland. He bution. Note that a f distribution with more degrees of freedom exhibits less variability 


developed the t distribution and more closely resembles the standard normal distribution. Note also that the mean 

while working on small- of the ¢ distribution is zero. 

scale materials and We place a subscript on ż to indicate the area in the upper tail of the ¢ distribution. For 

temperature experiments. example, just as we used Z 99; to indicate the z value providing a .025 area in the upper tail 
of a standard normal distribution, we will use tf); to indicate a .025 area in the upper tail of 
a t distribution. In general, we will use the notation f,,. to represent a ¢ value with an area of 
a/2 in the upper tail of the ¢ distribution. See Figure 8.6. 

Table 2 in Appendix B (available online) contains a table for the ¢ distribution. A 

portion of this table is shown in Table 8.2. Each row in the table corresponds to a sep- 

As the degrees of freedom arate t distribution with the degrees of freedom shown. For example, for a t distribution 

increase, the t distribution with 9 degrees of freedom, ft), = 2.262. Similarly, for a ¢ distribution with 60 degrees of 

approaches the standard freedom, t£ o25 = 2.000. As the degrees of freedom continue to increase, fy, approaches 

normal distribution. Zoo5 = 1.96. In fact, the standard normal distribution z values can be found in the infinite 
degrees of freedom row (labeled ©) of the ¢ distribution table. If the degrees of freedom 
exceed 100, the infinite degrees of freedom row can be used to approximate the actual 
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FIGURE 8.5 Comparison of the Standard Normal Distribution with t 


Distributions Having 10 and 20 Degrees of Freedom 


Standard normal distribution 
t distribution (20 degrees of freedom) 


t distribution (10 degrees of freedom) 


zt 


FIGURE 8.6 t Distribution with a/2 Area or Probability in the Upper Tail 


t value; in other words, for more than 100 degrees of freedom, the standard normal z value 
provides a good approximation to the ¢ value. 


Margin of Error and the Interval Estimate 
In Section 8.1 we showed that an interval estimate of a population mean for the o known 
case is 


y+ a 
X = Zan 
n 


To compute an interval estimate of u for the ø unknown case, the sample standard devia- 
tion s is used to estimate ø, and z,,, is replaced by the ¢ distribution value ¢,/.. The margin 
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TABLE 8.2 Selected Values from the t Distribution Table* 


Area or 
probability 


Area in Upper Tail 


Degrees 
of Freedom .20 .10 .05 .025 .01 .005 
1 1.376 3.078 6.314 12.706 31.821 63.656 
2 1.061 1.886 2.920 4.303 6.965 9925 
3 978 1.638 2353) 3.182 4.541 5.841 
4 941 1533 22 2.776 S747, 4.604 
5 .920 1.476 205 2 571 3.365 4.032 
6 906 1.440 1.943 2.447 Sh ANS 3.707 
7 .896 1.415 1895 21305 2.998 31499, 
8 .889 eso 1.860 2.306 2.896 3355) 
9 .883 1383 1.833 2.202 2.821 3.250 
60 .848 1.296 1.671 2.000 2.390 2.660 
61 .848 1296 1.670 2.000 2.389 2.659 
62 .847 1225 1.670 1299 2.388 2.657 
63 .847 1225 1.669 1.998 2 387 2.656 
64 .847 17295 1.669 1.998 2.386 2.655 
65 .847 17295 1.669 1997 2.385 2.654 
66 .847 1.295 1.668 1997 2.384 2.652 
67 .847 1.294 1.668 1.996 2.383 2.651 
68 .847 1.294 1.668 1.995 2.382 2.650 
69 .847 1.294 1.667 1285 2.382 2.649 
90 .846 1291 1.662 1.987 2.368 2.632 
oH .846 1.291 1.662 1.986 2.368 2.631 
92 .846 1.291 1.662 1.986 2.368 2.630 
93 .846 1291 1.661 1.986 2.367 2.630 
94 .845 T291 1.661 1.986 2.367 2.629 
95 .845 1291 1.661 1985 2.366 2.629 
96 .845 1.290 1.661 1:995 2.366 2.628 
OF, 845 1290 1.661 11995 2.365 2.627 
98 845 1.290 1.661 1.984 2.365 2.627 
99 845 1290 1.660 1.984 2.364 2.626 
100 845 1.290 1.660 1.984 2.364 2.626 
oo 842 1.282 1.645 1.960 2.326 2.576 


*Note: A more extensive table is provided as Table 2 of Appendix B (available online). 
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of error is then given by f,,.s/Vn. With this margin of error, the general expression for an 
interval estimate of a population mean when ø is unknown follows. 


INTERVAL ESTIMATE OF A POPULATION MEAN: c UNKNOWN 


X+ty = (8.2) 
where s is the sample standard deviation, (1 — a) is the confidence coefficient, and t, 
is the t value providing an area of a/2 in the upper tail of the ż distribution with n — 1 


degrees of freedom. 


The reason the number of degrees of freedom associated with the t value in expression 
(8.2) ism — 1 concerns the use of s as an estimate of the population standard deviation o. 
The expression for the sample standard deviation is 


Sia 
s= arn eC 
Vy n—1 


Degrees of freedom refer to the number of independent pieces of information that go into 
the computation of X(x; — x)”. The n pieces of information involved in computing 

d(x, — x)’ are as follows: x, — X, x, — X,...,x, — X. In Section 3.2 we indicated that 
X(x; — x) = 0 for any data set. Thus, only n — 1 of the x; — x values are independent; that 
is, if we know n — 1 of the values, the remaining value can be determined exactly by 
using the condition that the sum of the x; — x values must be 0. Thus, n — 1 is the number 
of degrees of freedom associated with X(x; — x) and hence the number of degrees of free- 
dom for the ¢ distribution in expression (8.2). 

To illustrate the interval estimation procedure for the o unknown case, we will con- 
sider a study designed to estimate the mean credit card debt for the population of U.S. 
households. A sample of n = 70 households provided the credit card balances shown in 
Table 8.3. For this situation, no previous estimate of the population standard deviation o 
is available. Thus, the sample data must be used to estimate both the population mean and 
the population standard deviation. Using the data in Table 8.3, we compute the sample 
mean x = $9312 and the sample standard deviation s = $4007. With 95% confidence and 
n — | = 69 degrees of freedom, Table 8.2 can be used to obtain the appropriate value for 
tos. We want the ¢ value in the row with 69 degrees of freedom, and the column corres- 
ponding to .025 in the upper tail. The value shown is fy); = 1.995. 


TABLE 8.3 Credit Card Balances for a Sample of 70 Households 


9,430 14,661 7,159 9,071 9,691 11,032 
7,535 12,195 8,137 3,603 11,448 6,525 
7 , 4,078 10,544 9,467 16,804 8,279 5,239 
S= DATA file 5,604 13,659 12,595 13,479 5,649 6,195 
NewBalangs 5,179 7,061 7,917 14,044 11,298 12,584 
4,416 6,245 11,346 6,817 4,353 15,415 
10,676 13,021 12,806 6,845 3,467 15,917 
1,627 9,719 4,972 10,493 6,191 12,591 
10,112 2,200 11,356 615 12,851 9,743 
6,567 10,746 77 13,627 5,337 10,324 

13,627 12,744 9,465 12,557 8,372 

18,719 5,742 19,263 6,232 7,445 
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We use expression (8.2) to compute an interval estimate of the population mean credit 


card balance. 
4007 
9312 + 1.995 —— 
V70 
9312 + 955 


The point estimate of the population mean is $9312, the margin of error is $955, and the 
95% confidence interval is 9312 — 955 = $8357 to 9312 + 955 = $10,267. Thus, we are 
95% confident that the mean credit card balance for the population of all households is 
between $8357 and $10,267. 


We will use the credit card balances in Table 8.3 to illustrate how Excel can be used to 
construct an interval estimate of the population mean for the o unknown case. We start 
by summarizing the data using Excel’s Descriptive Statistics tool described in Chapter 3. 
Refer to Figure 8.7 as we describe the tasks involved. The formula worksheet is in the 
background; the value worksheet is in the foreground. 


Enter/Access Data: Open the file NewBalance. A label and the credit card balances are 
entered in cells A1:A71. 


FIGURE 8.7 | Excel Worksheet: 95% Confidence Interval for Credit Card Balances 


Point Estimate =D3 


Lower Limit =D18-D16 2 


Upper Limit =D3+D16 3 


Note: Rows 21-69 
are hidden. 


Point Estimate 9312 
Lower Limit $8357 
Upper Limit 10267 
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Larger sample sizes are 
needed if the distribu- 
tion of the population is 
highly skewed or includes 
outliers. 
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Apply Analysis Tools: The following steps describe how to use Excel’s Descriptive Statis- 
tics tool for these data: 


Step 1. Click the Data tab on the Ribbon 
Step 2. In the Analysis group, click Data Analysis 
Step 3. Choose Descriptive Statistics from the list of Analysis Tools 
Step 4. When the Descriptive Statistics dialog box appears: 
Enter A/:A7/ in the Input Range: box 
Select Columns in the Grouped By: area 
Select the check box for Labels in First Row 
Select Output Range: in the Output Options area 
Enter C/ in the Output Range: box 
Select the check box for Summary Statistics 
Select the check box for Confidence Level for Mean: 
Enter 95 in the Confidence Level for Mean: box 
Click OK 


The sample mean (x) is in cell D3. The margin of error, labeled Confidence Level(95 %), 
appears in cell D16. The value worksheet shows x = 9312 and a margin of error equal 
to 955. 


Enter Functions and Formulas: Cells D18:D20 provide the point estimate and the lower 
and upper limits for the confidence interval. Because the point estimate is just the sample 
mean, the formula =D3 is entered in cell D18. To compute the lower limit of the 95% 
confidence interval, x — (margin of error), we enter the formula =D/8-D/6 in cell D19. To 
compute the upper limit of the 95% confidence interval, x + (margin of error), we enter the 
formula =D/&+D16 in cell D20. The value worksheet shows a lower limit of 8357 and an 
upper limit of 10,267. In other words, the 95% confidence interval for the population mean 
is from 8357 to 10,267. 


Practical Advice 


If the population follows a normal distribution, the confidence interval provided by expres- 
sion (8.2) is exact and can be used for any sample size. If the population does not follow a 
normal distribution, the confidence interval provided by expression (8.2) will be approx- 
imate. In this case, the quality of the approximation depends on both the distribution of the 
population and the sample size. 

In most applications, a sample size of n = 30 is adequate when using expression (8.2) 
to develop an interval estimate of a population mean. However, if the population distribu- 
tion is highly skewed or contains outliers, most statisticians would recommend increasing 
the sample size to 50 or more. If the population is not normally distributed but is roughly 
symmetric, sample sizes as small as 15 can be expected to provide good approximate 
confidence intervals. With smaller sample sizes, expression (8.2) should only be used if the 
analyst believes, or is willing to assume, that the population distribution is at least approxi- 
mately normal. 


Using a Small Sample 


In the following example we develop an interval estimate for a population mean when 

the sample size is small. As we already noted, an understanding of the distribution of the 
population becomes a factor in deciding whether the interval estimation procedure provides 
acceptable results. 

Scheer Industries is considering a new computer-assisted program to train maintenance 
employees to do machine repairs. In order to fully evaluate the program, the director of 
manufacturing requested an estimate of the population mean time required for maintenance 
employees to complete the computer-assisted training. 
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(@ 


DATA file 


Scheer 
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TABLE 8.4 Training Time in Days for a Sample of 20 Scheer 


Industries Employees 


44 50 42 48 
55 54 60 55 
44 62 62 57 
45 46 43 56 


A sample of 20 employees is selected, with each employee in the sample completing 
the training program. Data on the training time in days for the 20 employees are shown in 
Table 8.4. A histogram of the sample data appears in Figure 8.8. What can we say about 
the distribution of the population based on this histogram? First, the sample data do not 
support the conclusion that the distribution of the population is normal, yet we do not see 
any evidence of skewness or outliers. Therefore, using the guidelines in the previous sub- 
section, we conclude that an interval estimate based on the ¢ distribution appears acceptable 
for the sample of 20 employees. 

We continue by computing the sample mean and sample standard deviation as follows. 
2x; _ 1030 


m = “20 = 51.5 days 


x= 


pCa eg 889 
= = = 6.84 
Ss y ai 0-1 6.84 days 


Histogram of Training Times for the Scheer Industries Sample 


Frequency 
w 


Training Time (days) 
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For a 95% confidence interval, we use Table 2 of Appendix B (available online) and 
n— 1 = 19 degrees of freedom to obtain tops = 2.093. Expression (8.2) provides the inter- 
val estimate of the population mean. 


51.5 + 2.093 Da 
V20 


S15 + 32 


The point estimate of the population mean is 51.5 days. The margin of error is 3.2 days and 
the 95% confidence interval is 51.5 — 3.2 = 48.3 days to 51.5 + 3.2 = 54.7 days. 

Using a histogram of the sample data to learn about the distribution of a population is 
not always conclusive, but in many cases it provides the only information available. The 
histogram, along with judgment on the part of the analyst, can often be used to decide 
whether expression (8.2) can be used to develop the interval estimate. 


Summary of Interval Estimation Procedures 


We provided two approaches to developing an interval estimate of a population mean. For 
the o known case, ø and the standard normal distribution are used in expression (8.1) to 
compute the margin of error and to develop the interval estimate. For the o unknown case, 
the sample standard deviation s and the f distribution are used in expression (8.2) to com- 
pute the margin of error and to develop the interval estimate. 

A summary of the interval estimation procedures for the two cases is shown in 
Figure 8.9. In most applications, a sample size of n = 30 is adequate. If the population 
has a normal or approximately normal distribution, however, smaller sample sizes may be 
used. For the ø unknown case, a sample size of n = 50 is recommended if the population 
distribution is believed to be highly skewed or has outliers. 


FIGURE 8.9 Summary of Interval Estimation Procedures 


for a Population Mean 


Can the population 
standard deviation o 
be assumed known? 


T= WES 


Use the sample 
standard deviation 
s to estimate o 


o Known Case o Unknown Case 
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margin of error, 
Z«2(0@/ Vn), is fixed and is the same for all samples of size 
n. When ø is unknown, the margin of error, tyyo(s/Vn), var- 
ies from sample to sample. This variation occurs because 
the sample standard deviation s varies depending upon 
the sample selected. A large value for s provides a larger 
margin of error, while a small value for s provides a smaller 
margin of error. 

. What happens to confidence interval estimates when 
the population is skewed? Consider a population that is 
skewed to the right with large data values stretching the 
distribution to the right. When such skewness exists, the 
sample mean x and the sample standard deviation s are 
positively correlated. Larger values of s tend to be associ- 
ated with larger values of x. Thus, when x is larger than the 


NOTES + COMMENTS 


1. When ois. known, the 


than it would be with ø known. The confidence interval 
with the larger margin of error tends to include the pop- 
ulation mean u more often than it would if the true value 
of ø were used. But when x is smaller than the population 
mean, the correlation between x and s causes the margin 
of error to be small. In this case, the confidence interval 
with the smaller margin of error tends to miss the popula- 
tion mean more than it would if we knew ø and used it. For 
this reason, we recommend using larger sample sizes with 
highly skewed population distributions. 


. When developing a confidence interval for the mean with 


a sample size that is at least 5% of the population size 
(that is, n/N = .05), the finite population correction factor 
should be used when calculating the standard error of the 
sampling distribution of x when ø is unknown, i.e., 


population mean, s tends to be larger than ø. This skew- N= A s ) 


S= ==]: 
ness causes the margin of error, t(s/Vn), to be larger 2 N-1\Vn 


EXERCISES 


Methods 
11. For aż distribution with 16 degrees of freedom, find the area, or probability, in each 
region. 
To the right of 2.120 
To the left of 1.337 
To the left of — 1.746 
. To the right of 2.583 
Between —2.120 and 2.120 
Between — 1.746 and 1.746 
12. Faa the t value(s) for each of the following cases. 
a. Upper tail area of .025 with 12 degrees of freedom 
Lower tail area of .05 with 50 degrees of freedom 
Upper tail area of .01 with 30 degrees of freedom 
. Where 90% of the area falls between these two ¢ values with 25 degrees of freedom 
. Where 95% of the area falls between these two ¢ values with 45 degrees of freedom 
13. The following sample data are from a normal population: 10, 8, 12, 15, 13, 11, 6, 5. 
a. What is the point estimate of the population mean? 
b. What is the point estimate of the population standard deviation? 
c. With 95% confidence, what is the margin of error for the estimation of the popula- 
tion mean? 
d. What is the 95% confidence interval for the population mean? 
14. A simple random sample with n = 54 provided a sample mean of 22.5 and a sample 
standard deviation of 4.4. 
a. Develop a 90% confidence interval for the population mean. 
b. Develop a 95% confidence interval for the population mean. 
c. Develop a 99% confidence interval for the population mean. 
d. What happens to the margin of error and the confidence interval as the confidence 
level is increased? 


jrone se 


ones 
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JobSearch 
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HongKongMeals 
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(@ 


DATA file 


AutolInsurance 


(@ 


DATA file 


TeleHealth 
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Applications 


15. 


17. 


18. 


20. 


21. 


Weekly Sales Reports. Sales personnel for Skillings Distributors submit weekly 
reports listing the customer contacts made during the week. A sample of 65 weekly re- 
ports showed a sample mean of 19.5 customer contacts per week. The sample standard 
deviation was 5.2. Provide 90% and 95% confidence intervals for the population mean 
number of weekly customer contacts for the sales personnel. 


. Years to Bond Maturity. A sample of years to maturity and yield for 40 corporate 


bonds taken from Barron’s is in the file CorporateBonds. 

a. What is the sample mean years to maturity for corporate bonds and what is the 
sample standard deviation? 

b. Develop a 95% confidence interval for the population mean years to maturity. 

c. What is the sample mean yield on corporate bonds and what is the sample standard 
deviation? 

d. Develop a 95% confidence interval for the population mean yield on corporate bonds. 

Quality Ratings of Airports. The International Air Transport Association surveys 

business travelers to develop quality ratings for transatlantic gateway airports. The 

maximum possible rating is 10. Suppose a simple random sample of 50 business 
travelers is selected and each traveler is asked to provide a rating for the Miami 

International Airport. The file Miami contains the ratings obtained from the sample of 

50 business travelers. Develop a 95% confidence interval estimate of the population 

mean rating for Miami. 

Employment in Older Workers. Older people often have a hard time finding work. 

AARP reported on the number of weeks it takes a worker aged 55 plus to find a job. 

The data on number of weeks spent searching for a job contained in the file JobSearch 

are consistent with the AARP findings. 

a. Provide a point estimate of the population mean number of weeks it takes a worker 
aged 55 plus to find a job. 

b. At 95% confidence, what is the margin of error? 

c. What is the 95% confidence interval estimate of the mean? 

d. Discuss the degree of skewness found in the sample data. What suggestion would 
you make for a repeat of this study? 

Meal Cost in Hong Kong. The mean cost of a meal for two in a mid-range restaurant 

in Tokyo is $40. How do prices for comparable meals in Hong Kong compare? The 

file HongKongMeals contains the costs for a sample of 42 recent meals for two in 

Hong Kong mid-range restaurants. 

a. With 95% confidence, what is the margin of error? 

b. What is the 95% confidence interval estimate of the population mean? 

c. How do prices for meals for two in mid-range restaurants in Hong Kong compare to 
prices for comparable meals in Tokyo restaurants? 

Automobile Insurance Premiums. The average annual premium for automobile in- 

surance in the United States is $1503. The file Auto/nsurance contains annual premi- 

ums ($) that are representative of the website’s findings for the state of Michigan. 

Assume the population of annual premiums is approximately normal. 

a. Provide a point estimate of the mean annual automobile insurance premium in 
Michigan. 

b. Develop a 95% confidence interval for the mean annual automobile insurance 
premium in Michigan. 

c. Does the 95% confidence interval for the annual automobile insurance premium 
in Michigan include the national average for the United States? What is your 
interpretation of the relationship between auto insurance premiums in Michigan 
and the national average? 

Telemedicine. Health insurers are beginning to offer telemedicine services online 

that replace the common office visit. Wellpoint provides a video service that allows 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


374 Chapter 8 Interval Estimation 


subscribers to connect with a physician online and receive prescribed treatments 
(Bloomberg Businessweek). Wellpoint claims that users of its LiveHealth Online 
service saved a significant amount of money on a typical visit. The data in the file 
TeleHealth are consistent with the dollar savings per visit reported by Wellpoint. As- 
suming the population is roughly symmetric, construct a 95% confidence interval for 
the mean savings for a televisit to the doctor as opposed to an office visit. 
DATA fil 22. Movie Ticket Sales. According to Box Office Mojo, the film Black Panther 
f We achieved the top opening weekend gross domestic box office ticket sales in 2018 
BlackPanther with $202,033,951. The ticket sales revenue in dollars for a sample of 30 theaters is 
provided in the file Black Panther. 
a. What is the 95% confidence interval estimate for the mean ticket sales revenue per 
theater? Interpret this result. 
b. Using the movie ticket price of $9.11 per ticket, what is the estimate of the mean 
number of customers per theater? 
c. The movie was shown in 4020 theaters during its opening weekend. Estimate the 
total number of customers who saw Black Panther and the total box office ticket 
sales for the weekend. 


(@ 


8.3 Determining the Sample Size 


If a desired margin of In providing practical advice in the two preceding sections, we commented on the role of 
error is selected prior to the sample size in providing good approximate confidence intervals when the population is 
sampling, the procedures in not normally distributed. In this section, we focus on another aspect of the sample size 
this section can be usedto issue. We describe how to choose a sample size large enough to provide a desired margin 
determine the sample size of error. To understand how this process works, we return to the ø known case presented in 
necessary to satisfy the mar- Section 8.1. Using expression (8.1), the interval estimate is 
gin of error requirement. i 
x= KVA 

The quantity z,,,(a/Vn) is the margin of error. Thus, we see that z,/,, the population 
standard deviation ø, and the sample size n combine to determine the margin of error. 
Once we select a confidence coefficient 1 — a, z,,. can be determined. Then, if we 
have a value for o, we can determine the sample size n needed to provide any desired 
margin of error. Development of the formula used to compute the required sample 
size n follows. 

Let E = the desired margin of error: 


+ Co 
a 
al2 Vn 
Solving for Vn, we have 
Zant 
Vn = 2 


Squaring both sides of this equation, we obtain the following expression for the sam- 
ple size. 


Equation (8.3) can be used 

to provide a good sample 

size recommendation. How- ¢ AMPLE SIZE FOR AN INTERVAL ESTIMATE OF A POPULATION MEAN 
ever, judgment on the part 

of the analyst should be Ga o 

used to determine whether ae E (8.3) 


the final sample size should 
be adjusted upward. 


This sample size provides the desired margin of error at the chosen confidence level. 
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In equation (8.3), E is the margin of error that the user is willing to accept, and the value 
of zn follows directly from the confidence level to be used in developing the interval esti- 
mate. Although user preference must be considered, 95% confidence is the most frequently 
chosen value (Zo25 = 1.96). 

Finally, use of equation (8.3) requires a value for the population standard deviation ø. 
However, even if ø is unknown, we can use equation (8.3) provided we have a prelimin- 
ary or planning value for ø. In practice, one of the following procedures can be chosen. 


A planning value for the 1. Use the estimate of the population standard deviation computed from data of previ- 
ous studies as the planning value for ø. 

Use a pilot study to select a preliminary sample. The sample standard deviation 
from the preliminary sample can be used as the planning value for o. 

Use judgment or a “best guess” for the value of ø. For example, we might begin 

by estimating the largest and smallest data values in the population. The difference 
between the largest and smallest values provides an estimate of the range for the 
data. Finally, the range divided by 4 is often suggested as a rough approximation of 


the standard deviation and thus an acceptable planning value for o. 


population standard devi- 

ation o must be specified 2. 
before the sample size 

can be determined. Three 3. 
methods of obtaining a 

planning value for ø are 

discussed here. 


Let us demonstrate the use of equation (8.3) to determine the sample size by considering 
the following example. A previous study that investigated the cost of renting automobiles 
in the United States found a mean cost of approximately $55 per day for renting a midsize 
automobile. Suppose that the organization that conducted this study would like to conduct 
a new study in order to estimate the population mean daily rental cost for a midsize auto- 
mobile in the United States. In designing the new study, the project director specifies that 
the population mean daily rental cost be estimated with a margin of error of $2 and a 95% 
level of confidence. 

The project director specified a desired margin of error of E = 2, and the 95% level of 
confidence indicates Z o5 = 1.96. Thus, we only need a planning value for the population 
standard deviation o in order to compute the required sample size. At this point, an analyst 
reviewed the sample data from the previous study and found that the sample standard 
deviation for the daily rental cost was $9.65. Using 9.65 as the planning value for ø, 
we obtain 


Equation (8.3) provides 
the minimum sample 
size needed to satisfy 
the desired margin of 
error requirement. If the 
computed sample size is 
not an integer, rounding 
up to the next integer 
value will provide a margin 
of error slightly smaller 
than required. 


NOTE + COMMENT 


an (1.96)°(9.65) 
== > 


= 89.43 


n 


Thus, the sample size for the new study needs to be at least 89.43 midsize automobile 
rentals in order to satisfy the project director’s $2 margin-of-error requirement. In cases 
where the computed n is not an integer, we round up to the next integer value; hence, the 
recommended sample size is 90 midsize automobile rentals. 


Equation (8.3) provides the recommended sample size n for an 
infinite population as well as for a large finite population of size 
N provided n/N S .05. This is fine for most statistical studies. 
However, if we have a finite population such that n/N > .05, a 
smaller sample size can be used to obtain the desired margin 
of error. The smaller sample size, denoted by N’, can be com- 
puted using the following equation: 
; n 


n = — 


(1 + n/N) 


For example, suppose that the example presented in this sec- 
tion showing n = 89.43 was computed for a population of size 
N = 500. With n/N = 89.43/500 = .18 > .05, a smaller sample 
size can be computed by 

n 89.43 


= Ty n/N 14 8943/500 °S 


Thus, for the finite population of N = 500, the sample size 


required to obtain the desired margin of error E = 2 would be 
reduced from 90 to 76. 
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EXERCISES 


Methods 

23. How large a sample should be selected to provide a 95% confidence interval with a 
margin of error of 10? Assume that the population standard deviation is 40. 

24. The range for a set of data is estimated to be 36. 
a. What is the planning value for the population standard deviation? 
b. At 95% confidence, how large a sample would provide a margin of error of 3? 
c. At 95% confidence, how large a sample would provide a margin of error of 2? 


Applications 

25. Computer-Assisted Training. Refer to the Scheer Industries example in Section 8.2. 
Use 6.84 days as a planning value for the population standard deviation. 

a. Assuming 95% confidence, what sample size would be required to obtain a margin 
of error of 1.5 days? 

b. If the precision statement was made with 90% confidence, what sample size would 
be required to obtain a margin of error of 2 days? 

26. Gasoline Prices. The U.S. Energy Information Administration (US EIA) reported that 
the average price for a gallon of regular gasoline is $2.62. The US EIA updates its es- 
timates of average gas prices on a weekly basis. Assume the standard deviation is $.25 
for the price of a gallon of regular gasoline and recommend the appropriate sample 
size for the US EIA to use if they wish to report each of the following margins of error 
at 95% confidence. 

a. The desired margin of error is $.10. 

b. The desired margin of error is $.07. 

c. The desired margin of error is $.05. 

27. Salaries of Business Graduates. Annual starting salaries for college graduates with 
degrees in business administration are generally expected to be between $45,000 and 
$60,000. Assume that a 95% confidence interval estimate of the population mean 
annual starting salary is desired. 

a. What is the planning value for the population standard deviation? 

b. How large a sample should be taken if the desired margin of error is $500? $200? 
$100? 

c. Would you recommend trying to obtain the $100 margin of error? Explain. 

28. Beef Consumption. Many medical professionals believe that eating too much red 
meat increases the risk of heart disease and cancer. Suppose you would like to conduct 
a survey to determine the yearly consumption of beef by a typical American and want 
to use 3 pounds as the desired margin of error for a confidence interval estimate of 
the population mean amount of beef consumed annually. Use 25 pounds as a planning 
value for the population standard deviation and recommend a sample size for each of 
the following situations. 

a. A 90% confidence interval is desired for the mean amount of beef consumed. 

b. A 95% confidence interval is desired for the mean amount of beef consumed. 

c. A 99% confidence interval is desired for the mean amount of beef consumed. 

d. When the desired margin of error is set, what happens to the sample size as the con- 
fidence level is increased? Would you recommend using a 99% confidence interval 
in this case? Discuss. 

29. Length of Theater Previews. Customers arrive at a movie theater at the advertised 
movie time only to find that they have to sit through several previews and prepre- 
view ads before the movie starts. Many complain that the time devoted to previews 
is too long. A preliminary sample conducted by The Wall Street Journal showed that 
the standard deviation of the amount of time devoted to previews was 4 minutes. 
Use that as a planning value for the standard deviation in answering the following 
questions. 
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a. If we want to estimate the population mean time for previews at movie theaters 
with a margin of error of 75 seconds, what sample size should be used? Assume 
95% confidence. 

b. If we want to estimate the population mean time for previews at movie theaters 
with a margin of error of 1 minute, what sample size should be used? Assume 95% 
confidence. 

30. Miles Driven by Young Drivers. There has been a trend toward less driving in the last 
few years, especially by young people. Over the past eight years, the annual vehicle 
miles traveled by people from 16 to 34 years of age decreased from 10,300 to 7900 
miles per person. Assume the standard deviation is now 2000 miles. Suppose you 
would like to conduct a survey to develop a 95% confidence interval estimate of the 
annual vehicle-miles per person for people 16 to 34 years of age at the current time. 
A margin of error of 100 miles is desired. How large a sample should be used for the 
current survey? 


8.4 Population Proportion 


In the introduction to this chapter, we said that the general form of an interval estimate of a 
population proportion p is 


p = Margin of error 


The sampling distribution of p plays a key role in computing the margin of error for this 
interval estimate. 

In Chapter 7 we said that the sampling distribution of p can be approximated by a 
normal distribution whenever np = 5 and n(1 — p) = 5. Figure 8.10 shows the normal 
approximation of the sampling distribution of p. The mean of the sampling distribution of 
p is the population proportion p, and the standard error of p is 


1 = 
a4 po? (8.4) 


FIGURE 8.10 Normal Approximation of the Sampling Distribution of p 


Sampling distribution 
ofp 


1- 
Op = Jp - p) 
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Because the sampling distribution of p is normally distributed, if we choose z,.05 

as the margin of error in an interval estimate of a population proportion, we know that 
100(1 — a@)% of the intervals generated will contain the true population proportion. But 

g; cannot be used directly in the computation of the margin of error because p will not be 
known; p is what we are trying to estimate. So p is substituted for p and the margin of error 
for an interval estimate of a population proportion is given by 


aa 
Margin of error = Zz, 4 po? (8.5) 


With this margin of error, the general expression for an interval estimate of a population 
proportion is as follows. 


INTERVAL ESTIMATE OF A POPULATION PROPORTION 


When developing S 
z [PU — P) 
confidence intervals for BPE coe: (8.6) 


proportions, the quantity 
Z42V p(1 = p)/n provides where 1 — a is the confidence coefficient and z,/, is the z value providing an area of 


the margin of error. a/2 in the upper tail of the standard normal distribution. 

= batil | | | | 

— DATA fi e The following example illustrates the computation of the margin of error and interval 
TeeTimes estimate for a population proportion. A national survey of 900 women golfers was conducted 


to learn how women golfers view their treatment at golf courses in the United States. The 
survey found that 396 of the women golfers were satisfied with the availability of tee times. 
Thus, the point estimate of the proportion of the population of women golfers who are sat- 
isfied with the availability of tee times is 396/900 = .44. Using expression (8.6) and a 95% 


confidence level, 
z JPA — p) 
P= Zan n 


44(1 — .44) 


44 + 1.96 
900 


44 + 0324 


Thus, the margin of error is .0324 and the 95% confidence interval estimate of the popula- 
tion proportion is .4076 to .4724. Using percentages, the survey results enable us to state 
with 95% confidence that between 40.76% and 47.24% of all women golfers are satisfied 
with the availability of tee times. 


Excel can be used to construct an interval estimate of the population proportion of women 
golfers who are satisfied with the availability of tee times. The responses in the survey 
were recorded as a Yes or No for each woman surveyed. Refer to Figure 8.11 as we de- 
scribe the tasks involved in constructing a 95% confidence interval. The formula worksheet 
is in the background; the value worksheet appears in the foreground. 


Enter/Access Data: Open the file TeeTimes. A label and the Yes/No data for the 900 
women golfers are entered in cells A1:A901. 


Enter Functions and Formulas: The descriptive statistics we need and the response of 
interest are provided in cells D3:D6. Because Excel’s COUNT function works only with 
numerical data, we used the COUNTA function in cell D3 to compute the sample size. 
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FIGURE 8.11 | Excel Worksheet: 95% Confidence Interval for Survey of Women Golfers 


4 A B BR ee ee i Es 
1 Interval Estimate of a Population Proportion 
2 | Yes 
3! No Sample Size -COUNTA(A2:A901) 
4| Yes Response of Interest Yes 
5. Yes Count for Response =COUNTIF(A2:A901,D4) 4 B c D E F G 
$ j bie Sample Proportion SEMS 1 = Interval Estimate of a Population Proportion i 
| 2 | Yes 
8 No Confidence Coefficient 0.95 i 
D> s ; 3 No Sample Size 900 Enter Ver a the 
T D ATA f l 9 Yes Level of Significance (alpha) =1-D8 ae 
— we E k a| Yes Response of Interest Yes 
= if | is z Value NORM.S.INV(1-D9/2) s| Ya eka M ana 396 
TeeTimes 12 No S rd Error =SQRT(DS*(1-D6)/D3) 6 = Sample Proportion 0.44 
aS: tanda 7 
n | bed Margin of Error SRIVIDIZ 8. No Confidence Coefficient 0.95 
15| No Point Estimate =D6 9| Yes Level of Significance 0.05 
16| No Lower Limit =D15-D13 ki = peen aas 
af = Upper Linh ANa 12] No Standard Error 0.0165 
300) Yes 13! No Margin of Error 0.0324 
901 Yes u ee 
Es 15! No Point Estimate 0.44 
16, No Lower Limit 0.4076 
Note: Rows 19 to A Na Upper Lisie MEN 
: 18 
899 are hidden. 900. Yes 
901. Yes 
902 


The response for which we want to develop an interval estimate, Yes or No, is entered 

in cell D4. Figure 8.11 shows that Yes has been entered in cell D4, indicating that we 
want to develop an interval estimate of the population proportion of women golfers who 
are satisfied with the availability of tee times. If we had wanted to develop an interval 
estimate of the population proportion of women golfers who are not satisfied with the 
availability of tee times, we would have entered No in cell D4. With Yes entered in cell 
D4, the COUNTIF function in cell D5 counts the number of Yes responses in the sample. 
The sample proportion is then computed in cell D6 by dividing the number of Yes re- 
sponses in cell D5 by the sample size in cell D3. 

Cells D8:D10 are used to compute the appropriate z value. The confidence coefficient 
(0.95) is entered in cell D8 and the level of significance (œ) is computed in cell D9 by 
entering the formula = /-D8. The z value corresponding to an upper tail area of a/2 is com- 
puted by entering the formula =NORM.S.INV(1-D9/2) in cell D10. The value worksheet 
shows that Z 9); = 1.96. 

Cells D12:D13 provide the estimate of the standard error and the margin of error. In cell 
D12, we entered the formula =SQRT(D6*(1-D6)/D3) to compute the standard error using 
the sample proportion and the sample size as inputs. The formula =D/0*D/2 is entered in 
cell D13 to compute the margin of error. 

Cells D15:D17 provide the point estimate and the lower and upper limits for a con- 
fidence interval. The point estimate in cell D15 is the sample proportion. The lower and 
upper limits in cells D16 and D17 are obtained by subtracting and adding the margin of 
error to the point estimate. We note that the 95% confidence interval for the proportion of 
women golfers who are satisfied with the availability of tee times is .4076 to .4724. 


A Template for Other Problems The worksheet in Figure 8.11 can be used as a template 
for developing confidence intervals about a population proportion p. To use this worksheet 
for another problem of this type, we must first enter the new problem data in column A. 
The response of interest would then be typed in cell D4 and the ranges for the formulas in 
cells D3 and D5 would be revised to correspond to the new data. After doing so, the point 
estimate and a 95% confidence interval will be displayed in cells D15:D17. If a confidence 
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interval with a different confidence coefficient is desired, we simply change the value in 
cell D8. 


Determining the Sample Size 


Let us consider the question of how large the sample size should be to obtain an estimate 
of a population proportion at a specified level of precision. The rationale for the sample 
size determination in developing interval estimates of p is similar to the rationale used in 
Section 8.3 to determine the sample size for estimating a population mean. 

Previously in this section we said that the margin of error associated with an interval 
estimate of a population proportion is z,,.Vp(. — p)/n. The margin of error is based on the 
value of z,n, the sample proportion p, and the sample size n. Larger sample sizes provide a 
smaller margin of error and better precision. 

Let E denote the desired margin of error. 


pd — p) 


n 


E = Zan 


Solving this equation for n provides a formula for the sample size that will provide a mar- 
gin of error of size E. 


(Zap) PU — P) 
ee 
Note, however, that we cannot use this formula to compute the sample size that will 
provide the desired margin of error because p will not be known until after we select the 
sample. What we need, then, is a planning value for p that can be used to make the compu- 


tation. Using p* to denote the planning value for p, the following formula can be used to 
compute the sample size that will provide a margin of error of size E. 


SAMPLE SIZE FOR AN INTERVAL ESTIMATE OF A POPULATION PROPORTION 


Gppd—p) 
n= E 


(8.7) 


In practice, the planning value p* can be chosen by one of the following procedures. 


1. Use the sample proportion from a previous sample of the same or similar units. 

2. Use a pilot study to select a preliminary sample. The sample proportion from this 
sample can be used as the planning value, p*. 

3. Use judgment or a “best guess” for the value of p*. 

4. If none of the preceding alternatives applies, use a planning value of p* = .50. 


Let us return to the survey of women golfers and assume that the company is interested 
in conducting a new survey to estimate the current proportion of the population of women 
golfers who are satisfied with the availability of tee times. How large should the sample be 
if the survey director wants to estimate the population proportion with a margin of error of 
.025 at 95% confidence? With E = .025 and z,/. = 1.96, we need a planning value p* to 
answer the sample size question. Using the previous survey result of p = .44 as the plan- 
ning value p*, equation (8.7) shows that 


(Zap) P*(1 — p*) — (1.96)*(.44)(1 — .44) 
a E = (025) 


= 1514.5 


Thus, the sample size must be at least 1514.5 women golfers to satisfy the margin of error 
requirement. Rounding up to the next integer value indicates that a sample of 1515 women 
golfers is recommended to satisfy the margin of error requirement. 
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TABLE 8.5 Some Possible Values for p*(1 — p*) 

p* p*(1 — p*) 

10 (.10)(.90) = .09 

30 (.30)(.70) = .21 

40 (.40)(.60) = .24 

50 (50) G50) 825) < Largest value for p*(1 — p*) 
60 (.60)(.40) = .24 

70 (.70)(.30) = .21 

90 (.90)(.10) = .09 


The fourth alternative suggested for selecting a planning value p* is to use p* = .50. This 
value of p* is frequently used when no other information is available. To understand why, note 
that the numerator of equation (8.7) shows that the sample size is proportional to the quant- 
ity p*(1 — p*). A larger value for the quantity p*(1 — p*) will result in a larger sample size. 
Table 8.5 gives some possible values of p*(1 — p*). Note that the largest value of p*(1 — p*) 
occurs when p* = .50. Thus, in case of any uncertainty about an appropriate planning value, 
we know that p* = .50 will provide the largest sample size recommendation. In effect, we play 
it safe by recommending the largest necessary sample size. If the sample proportion turns out 
to be different from the .50 planning value, the margin of error will be smaller than anticip- 
ated. Thus, in using p* = .50, we guarantee that the sample size will be sufficient to obtain the 
desired margin of error. 

In the survey of women golfers example, a planning value of p* = .50 would have 
provided the sample size 


(Zap) P*(1 — p*) — (1.96)°(.50)(1 — .50) 
a E 7 (025)? 


Thus, a slightly larger sample size of 1537 women golfers would be recommended. 


NOTES + COMMENTS 


of size N provided n/N < .05. This is fine for most statistical 


= 1536.6 


1. The desired margin of error for estimating a population 


proportion is almost always .10 or less. In national public 
opinion polls conducted by organizations such as Gallup 
and Harris, a .03 or .04 margin of error is common. With 
such margins of error, equation (8.7) will almost always 
provide a sample size that is large enough to satisfy the 
requirements of np = 5 and n(1 — p) = 5 for using a nor- 
mal distribution as an approximation for the sampling dis- 
tribution of x. 

. When developing a confidence interval for the proportion 
with a sample size that is at least 5% of the population size 
(that is, n/N = .05), the finite population correction factor 
should be used when calculating the standard error of the 
N-n [pl — p) 
N-1 no 
. Equation (8.7) provides the recommended sample size n for 


sampling distribution of P, i.e., S5 


an infinite population as well as for a large finite population 


studies. However, if we have a finite population such that 
n/N > .05, a smaller sample size can be used to obtain the 
desired margin of error. The smaller sample size denoted 
by n’ can be computed using the following equation. 


nag oe 


(1 + n/N) 


For example, suppose that the example presented in this 
section showing n = 1536.6 was computed for a popula- 
tion of size N = 2500. With n/N = 1536.6/2500 = .61 > .05, 
a smaller sample size can be computed by 


n 1536.6 
Y= T+ n/N) (1+ 1536.6/2500) l0 


Thus, for the finite population of N = 2500, the sample size 
required to obtain the desired margin of error E = .025 
would be reduced from 1537 to 952. 
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Chapter 8 Interval Estimation 


EXERCISES 


Methods 
31. A simple random sample of 400 individuals provides 100 Yes responses. 


a. 


b. 
C: 


What is the point estimate of the proportion of the population that would provide 
Yes responses? 

What is your estimate of the standard error of the proportion, o;? 

Compute the 95% confidence interval for the population proportion. 


32. A simple random sample of 800 elements generates a sample proportion p = .70. 


a. 
b. 


Provide a 90% confidence interval for the population proportion. 
Provide a 95% confidence interval for the population proportion. 


33. Ina survey, the planning value for the population proportion is p* = .35. How large 

a sample should be taken to provide a 95% confidence interval with a margin of error 
of .05? 

34. At 95% confidence, how large a sample should be taken to obtain a margin of error of 
.03 for the estimation of a population proportion? Assume that past data are not avail- 
able for developing a planning value for p*. 


Applications 

35. Health-Care Survey. In the spring of 2017, the Consumer Reports National Research 
Center conducted a survey of 1007 adults to learn about their major health-care con- 
cerns. The survey results showed that 574 of the respondents lack confidence they will 
be able to afford health insurance in the future. 


36. 


a. 


What is the point estimate of the population proportion of adults who lack 
confidence they will be able to afford health insurance in the future. 


b. At 90% confidence, what is the margin of error? 


d. 


Develop a 90% confidence interval for the population proportion of adults who lack 
confidence they will be able to afford health insurance in the future. 
Develop a 95% confidence interval for this population proportion. 


Automobile Insurance Coverage. According to statistics reported on CNBC, a surpris- 
ing number of motor vehicles are not covered by insurance. Sample results, consistent 
with the CNBC report, showed 46 of 200 vehicles were not covered by insurance. 


a. 
b. 
. Voter Sentiment. One of the questions Rasmussen Reports included on a 2018 survey 


What is the point estimate of the proportion of vehicles not covered by insurance? 
Develop a 95% confidence interval for the population proportion. 


of 2.500 likely voters asked if the country is headed in the right direction. Represen- 
tative data are shown in the file RightDirection. A response of Yes indicates that the 
respondent does think the country is headed in the right direction. A response of No 
indicates that the respondent does not think the country is headed in the right direction. 
Respondents may also give a response of Not Sure. 


a. 


What is the point estimate of the proportion of the population of likely voters who 
do think that the country is headed in the right direction? 


b. At 95% confidence, what is the margin of error? 


What is the 95% confidence interval for the proportion of likely voters who do 
think that the country is headed in the right direction? 


. What is the 95% confidence interval for the proportion of likely voters who do not 


think that the country is headed in the right direction? 
Which of the confidence intervals in parts (c) and (d) has the smaller margin of 
error? Why? 


. Franchise Profits. According to Franchise Business Review, over 50% of all food fran- 


chises earn a profit of less than $50,000 a year. In the sample of 142 casual dining restau- 
rants contained in the file CasualDining, 81 earned a profit of less than $50,000 last year. 


a. 


What is the point estimate of the proportion of casual dining restaurants that earned 
a profit of less than $50,000 last year? 
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b. Determine the margin of error and provide a 95% confidence interval for the 
proportion of casual dining restaurants that earned a profit of less than $50,000 
last year. 

c. How large a sample is needed if the desired margin of error is .03? 

39. Stay-at-Home Parenting. In June 2014, Pew Research reported that in 16% of all 
homes with a stay-at-home parent, the father is the stay-at-home parent. An independ- 
ent research firm has been charged with conducting a sample survey to obtain more 
current information. 

a. What sample size is needed if the research firm’s goal is to estimate the current 
proportion of homes with a stay-at-home parent in which the father is the stay-at- 
home parent with a margin of error of .03? Use a 95% confidence level. 

b. Repeat part (a) using a 99% confidence level. 

40. Employee Contributions to Health-Care Coverage. For many years businesses 
have struggled with the rising cost of health care. But recently, the increases have 
slowed due to less inflation in health care prices and employees paying for a larger 
portion of health care benefits. A recent Mercer survey showed that 52% of U.S. 
employers were likely to require higher employee contributions for health care 
coverage in the upcoming year. Suppose the survey was based on a sample of 
800 companies. Compute the margin of error and a 95% confidence interval for the 
proportion of companies likely to require higher employee contributions for health 
care coverage in the upcoming year. 

41. Driver’s License Rates. Fewer young people are driving. In 1995, 63.9% of peo- 
ple under 20 years old who were eligible had a driver’s license. Bloomberg reported 
that percentage had dropped to 41.7% in 2016. Suppose these results are based on a 
random sample of 1200 people under 20 years old who were eligible to have a driver’s 
license in 1995 and again in 2016. 

a. At 95% confidence, what is the margin of error and the interval estimate of the 
number of eligible people under 20 years old who had a driver’s license in 1995? 

b. At 95% confidence, what is the margin of error and the interval estimate of the 
number of eligible people under 20 years old who had a driver’s license in 2016? 

c. Is the margin of error the same in parts (a) and (b)? Why or why not? 

42. Voter Intent. A poll for the presidential campaign sampled 491 potential voters in 
June. A primary purpose of the poll was to obtain an estimate of the proportion of 
potential voters who favored each candidate. Assume a planning value of p* = .50 and 
a 95% confidence level. 

a. For p* = .50, what was the planned margin of error for the June poll? 

b. Closer to the November election, better precision and smaller margins of error are 
desired. Assume the following margins of error are requested for surveys to be 
conducted during the presidential campaign. Compute the recommended sample 
size for each survey. 


Survey Margin of Error 
September .04 
October .03 
Early November 02 
Pre-Election Day 01 


43. Internet Usage. The Pew Research Center Internet Project provides a variety of statis- 
tics on Internet users. For instance, in 2018, 11% (or 220) of the 2002 American adults 
surveyed were not Internet users. In 2000, 48% of the American adults in the survey 
did not use the Internet. 

a. The 2018 sample survey showed that 34% of respondents who do not use the Internet 
said that they have no interest in doing so or do not think the Internet is relevant to their 
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lives. Develop a 95% confidence interval for the proportion of non-Internet users who 
have no interest in using the Internet or do not think the Internet is relevant to their lives. 

b. The 2018 sample survey showed that 32% of respondents who do not use the Inter- 
net said that the Internet is too difficult to use. Develop a 95% confidence interval 
for the proportion of non-Internet users who feel the Internet is too difficult to use. 

c. The 2018 sample survey showed that 19% of respondents who do not use the In- 
ternet said that the expense of Internet service or owning a computer prevents them 
from using the Internet. Develop a 95% confidence interval for the proportion of 
non-Internet users for whom the expense of Internet service or owning a computer 
prevents them from using the Internet. 

d. The 2018 sample survey showed that 8% of respondents who do not use the 
Internet said that they are too old to learn how to use the Internet. Develop a 95% 
confidence interval for the proportion of non-Internet users who feel they are too 
old to learn how to use the Internet. 

e. Compare the margin of error for the interval estimates in parts (a), (b), (c), and (d). 
How is the margin of error related to the sample proportion? 


8.5 Practical Advice: Big Data and Interval Estimation 


We have seen that confidence intervals are powerful tools for making inferences about popu- 
lation parameters. We now consider the ramifications of big data on confidence intervals for 
means and proportions, and we return to the data-collection problem of online news service 
PenningtonDailyTimes.com (PDT). Recall that PDT’s primary source of revenue is the sale of 
advertising, so PDT’s management is concerned about the time customers spend during their 
visits to PDT’s website and whether visitors click on any of the ads featured on the website. 


Big Data and the Precision of Confidence Intervals 


A review of equations (8.2) and (8.6) shows that confidence intervals for the population 
mean u and population proportion p become more narrow as the size of the sample in- 
creases. Therefore, the potential sampling error also decreases as the sample size increases. 
To illustrate the rate at which interval estimates narrow for a given confidence level, we 
consider the online news service PenningtonDailyTimes.com (PDT). 

Prospective advertisers are willing to pay a premium to advertise on websites that have 
long visit times, so the time customers spend during their visits to PDT’s website has a 
substantial impact on PDT’s advertising revenues. Suppose PDT’s management wants to 
develop a 95% confidence interval estimate of the mean amount of time customers spend 
during their visits to PDT’s website. Table 8.6 shows how the margin of error at the 95% 
confidence level decreases as the sample size increases when s = 20. 


TABLE 8.6 Margin of Error for Interval Estimates of the Population Mean 


at the 95% Confidence Level for Various Sample Sizes 


Sample Size n Margin of Error t,/25; 
10 14.30714 
100 3.96843 
1,000 1.24109 
10,000 39204 
100,000 .12396 
1,000,000 .03920 
10,000,000 01240 
100,000,000 .00392 
1,000,000,000 00124 
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TABLE 8.7 Margin of Error for Interval Estimates of the Population Proportion 


at the 95% Confidence Level for Various Sample Sizes 


Sample Size n Margin of Error z,,.05 
10 30984 
100 .09798 
1,000 .03098 
10,000 .00980 
100,000 .00310 
1,000,000 .00098 
10,000,000 .00031 
100,000,000 .00010 
1,000,000,000 .00003 


Suppose that in addition to estimating the population mean amount of time customers 
spend during their visits to PDT’s website, PDT would like to develop a 95% confidence 
interval estimate of the proportion of its website visitors that click on an ad. Table 8.7 
shows how the margin of error for a 95% confidence interval estimate of the population 
proportion decreases as the sample size increases when the sample proportion is p = .51. 

The PDT example illustrates the relationship between the precision of interval estim- 
ates and the sample size. We see in Tables 8.6 and 8.7 that at a given confidence level, 
the margins of error decrease as the sample sizes increase. As a result, if the sample 
mean time spent by customers when they visit PDT’s website is 84.1 seconds, the 95% 
confidence interval estimate of the population mean time spent by customers when they 
visit PDT’s website decreases from (69.79286, 98.40714) for a sample of n = 10 to 
(83.97604, 84.22396) for a sample of n = 100,000 to (84.09876, 84.10124) for a sample 
of n = 1,000,000,000. Similarly, if the sample proportion of its website visitors who 
clicked on an ad is .51, the 95% confidence interval estimate of the population propor- 
tion of its website visitors who clicked on an ad decreases from (.20016, .81984) for a 
sample of n = 10 to (.50690, .51310) for a sample of n = 100,000 to (.50997, .51003) 
for a sample of n = 1,000,000,000. In both instances, as the sample size becomes ex- 
tremely large, the margin of error becomes extremely small and the resulting confidence 
intervals become extremely narrow. 


Implications of Big Data for Confidence Intervals 


Last year the mean time spent by all visitors to PenningtonDailyTimes.com was 84 seconds. 
Suppose that PDT wants to assess whether the population mean time has changed since 

last year. PDT now collects a new sample of 1,000,000 visitors to its website and calculates 
the sample mean time spent by these visitors to the PDT website to be x = 84.1 seconds. 
The estimated population standard deviation is s = 20 seconds, so the standard error is 

s; = s/Vn = .02000. Furthermore, the sample is sufficiently large to ensure that the 
sampling distribution of the sample mean will be normally distributed. Thus, the 95% con- 
fidence interval estimate of the population mean is 


xX + tanszy = 84.1 + .0392 = (84.06080, 84.13920) 


What could PDT conclude from these results? There are three possible reasons that 
PDT’s sample mean of 84.1 seconds differs from last year’s population mean of 84 
seconds: (1) sampling error, (2) nonsampling error, or (3) the population mean has changed 
since last year. The 95% confidence interval estimate of the population mean does not 
include the value for the mean time spent by all visitors to the PDT website for last year 
(84 seconds), suggesting that the difference between PDT’s sample mean for the new sample 
(84.1 seconds) and the mean from last year (84 seconds) is not likely to be exclusively a 
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consequence of sampling error. Nonsampling error is a possible explanation and should be 
investigated as the results of statistical inference become less reliable as nonsampling error 
is introduced into the sample data. If PDT determines that it introduced little or no non- 
sampling error into its sample data, the only remaining plausible explanation for a differ- 
ence of this magnitude is that the population mean has changed since last year. 

If PDT concludes that the sample has provided reliable evidence and the population 
mean has changed since last year, management must still consider the potential impact of 
the difference between the sample mean and the mean from last year. If a .1 second differ- 
ence in the time spent by visitors to PenningtonDailyTimes.com has a consequential effect 
on what PDT can charge for advertising on its site, this result could have practical business 
implications for PDT. Otherwise, there may be no practical significance of the .1 second 
difference in the time spent by visitors to PenningtonDailyTimes.com. 

Confidence intervals are extremely useful, but as with any other statistical tool, they 
are only effective when properly applied. Because interval estimates become increasingly 
precise as the sample size increases, extremely large samples will yield extremely precise 
estimates. However, no interval estimate, no matter how precise, will accurately reflect the 
parameter being estimated unless the sample is relatively free of nonsampling error. There- 
fore, when using interval estimation, it is always important to carefully consider whether a 
random sample of the population of interest has been taken. 


NOTE + COMMENT 


When taking an extremely large sample, it is conceivable that finite population correction factor when calculating the stan- 
the sample size is at least 5% of the population size; that is, | dard error of the sampling distribution. 
n/N = .05. Under these conditions it is necessary to use the 


EXERCISES 


Applications 
DATA file 44. Federal Tax Return Errors. Suppose a sample of 10001 erroneous Federal income tax 
returns from last year has been taken and is provided in the file FedTaxErrors. A positive 
value indicates the taxpayer underpaid and a negative value indicates that the taxpayer 
overpaid. Also suppose the IRS has established that the standard deviation of errors made on 
Federal income tax returns is 0 = 12,300. 
a. What is the point estimate for the mean error made on erroneous Federal income 
tax returns last year? 
b. Using 95% confidence, what is the margin of error? 
c. Using the results from part (a) and part (b), develop the 95% confidence interval 
estimate of the mean error made on erroneous Federal income tax returns last year. 
DATA fi le 45. Federal Government Employee Sick Hours. According to the Census Bureau, 
2,475,780 people are employed by the federal government in the United States as of 
2018. Suppose that a random sample of 3500 of these federal employees was selected 
and the number of sick hours each of these employees took last year was collected 
from an electronic personnel database. The data collected in this survey are provided 
in the file FedSickHours. Based on historical data, the population standard deviation 
can be assumed to be known with o = 34.5. 
a. What is the point estimate for the mean number of sick hours taken by federal 
employees last year? 
b. Using 99% confidence, what is the margin of error? 
c. Using the results from part (a) and part (b), develop the 99% confidence interval 
estimate of the mean number of sick hours taken by federal employees last year. 
d. If the mean sick hours federal employees took two years ago was 62.2, what would 
the confidence interval in part (c) lead you to conclude about last year? 


(@ 


FedTaxErrors 


(@ 


FedSickHours 
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46. Web Browser Satisfaction. Internet users were recently asked online to rate their satis- 
faction with the web browser they use most frequently. Of 102,519 respondents, 65,120 
indicated they were very satisfied with the web browser they use most frequently. 

a. What is the point estimate for the proportion of Internet users who are very satisfied 
with the web browser they use most frequently? 

b. Using 95% confidence, what is the margin of error? 

c. Using the results from part (a) and part (b), develop the 95% confidence interval 
estimate of the proportion of Internet users who are very satisfied with the web 
browser they use most frequently. 

47. Speeding Drivers. In 2017, ABC News reports that 58% of U.S. drivers admit to 
speeding. Suppose that a new satellite technology can instantly measure the speed of 
any vehicle on a U.S. road and determine whether the vehicle is speeding, and this 
satellite technology was used to take a sample of 20,000 vehicles at 6:00 P.M. EST on a 
recent Tuesday afternoon. Of these 20,000 vehicles, 9252 were speeding. 

a. What is the point estimate for the proportion of vehicles on U.S. roads that speed? 

b. Using 99% confidence, what is the margin of error? 

c. Using the results from part (a) and part (b), develop the 99% confidence interval 
estimate of the proportion of vehicles on U.S. roads that speed. 

d. What the confidence interval in part (c) lead you to conclude about the ABC News 
report? 


Pes | EA 


In this chapter we presented methods for developing interval estimates of a population 
mean and a population proportion. A point estimator may or may not provide a good es- 
timate of a population parameter. The use of an interval estimate provides a measure of the 
precision of an estimate. Both the interval estimate of the population mean and the popula- 
tion proportion are of the form: point estimate + margin of error. 

We presented interval estimates for a population mean for two cases. In the o known 
case, historical data or other information is used to develop an estimate of o prior to taking 
a sample. Analysis of new sample data then proceeds based on the assumption that o is 
known. In the ø unknown case, the sample data are used to estimate both the population 
mean and the population standard deviation. The final choice of which interval estimation 
procedure to use depends upon the analyst’s understanding of which method provides the 
best estimate of o. 

In the o known case, the interval estimation procedure is based on the assumed value 
of ø and the use of the standard normal distribution. In the ø unknown case, the inter- 
val estimation procedure uses the sample standard deviation s and the ¢ distribution. In 
both cases the quality of the interval estimates obtained depends on the distribution of 
the population and the sample size. If the population is normally distributed the interval 
estimates will be exact in both cases, even for small sample sizes. If the population is 
not normally distributed, the interval estimates obtained will be approximate. Larger 
sample sizes will provide better approximations, but the more highly skewed the popu- 
lation is, the larger the sample size needs to be to obtain a good approximation. Practical 
advice about the sample size necessary to obtain good approximations was included in 
Sections 8.1 and 8.2. In most cases a sample of size 30 or more will provide good ap- 
proximate confidence intervals. 

The general form of the interval estimate for a population proportion is p + margin of error. 
In practice the sample sizes used for interval estimates of a population proportion are generally 
large. Thus, the interval estimation procedure is based on the standard normal distribution. 

Often a desired margin of error is specified prior to developing a sampling plan. We 
showed how to choose a sample size large enough to provide the desired precision. Finally, 
we discussed the ramifications of extremely large samples on the precision of confidence 
interval estimates of the mean and proportion. 
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GLOSSARY 


Confidence coefficient The confidence level expressed as a decimal value. For example, 
.95 is the confidence coefficient for a 95% confidence level. 

Confidence interval Another name for an interval estimate. 

Confidence level The confidence associated with an interval estimate. For example, if 

an interval estimation procedure provides intervals such that 95% of the intervals formed 
using the procedure will include the population parameter, the interval estimate is said to 
be constructed at the 95% confidence level. 

Degrees of freedom A parameter of the ¢ distribution. When the ¢ distribution is used in 
the computation of an interval estimate of a population mean, the appropriate f distribution 
has n — 1 degrees of freedom, where n is the size of the sample. 

Interval estimate An estimate of a population parameter that provides an interval believed 
to contain the value of the parameter. For the interval estimates in this chapter, it has the 
form: point estimate + margin of error. 

Level of significance The probability that the interval estimation procedure will generate 
an interval that does not contain m. 

Margin of error The + value added to and subtracted from a point estimate in order to 
develop an interval estimate of a population parameter. 

Practical significance The real-world impact the result of statistical inference will have 
on business decisions. 

o known The case when historical data or other information provide a good value for the 
population standard deviation prior to taking a sample. The interval estimation procedure 
uses this known value of ø in computing the margin of error. 

o unknown The more common case when no good basis exists for estimating the popula- 
tion standard deviation prior to taking the sample. The interval estimation procedure uses 
the sample standard deviation s in computing the margin of error. 

t distribution A family of probability distributions that can be used to develop an interval 
estimate of a population mean whenever the population standard deviation o is unknown 
and is estimated by the sample standard deviation s. 


KEY ORMULA S ar 
Interval Estimate of a Population Mean: o Known 


ane ea (8.1) 


n 


Interval Estimate of a Population Mean: o Unknown 


S 
x + ty = 8.2 
X= laj Ja (8.2) 
Sample Size for an Interval Estimate of a Population Mean 
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i in Te (8.3) 
Interval Estimate of a Population Proportion 
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P= Zan a (8.6) 
n 
Sample Size for an Interval Estimate of a Population Proportion 
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48. Discount Brokerage Trade Fees. A sample survey of 54 discount brokers showed 
that the mean price charged for a trade of 100 shares at $50 per share was $33.77. The 
survey is conducted annually. With the historical data available, assume a known popu- 
lation standard deviation of $15. 

a. Using the sample data, what is the margin of error associated with a 95% confi- 
dence interval? 

b. Develop a 95% confidence interval for the mean price charged by discount brokers 
for a trade of 100 shares at $50 per share. 

49. Family Vacation Expenses. A survey conducted by the American Automobile As- 
sociation (AAA) showed that a family of four spends an average of $215.60 per day 
while on vacation. Suppose a sample of 64 families of four vacationing at Niagara 
Falls resulted in a sample mean of $252.45 per day and a sample standard deviation 
of $74.50. 

a. Develop a 95% confidence interval estimate of the mean amount spent per day by a 
family of four visiting Niagara Falls. 

b. Based on the confidence interval from part (a), does it appear that the population 
mean amount spent per day by families visiting Niagara Falls differs from the mean 
reported by the American Automobile Association? Explain. 

50. Annual Restaurant Expenditures. The 92 million Americans of age 50 and over con- 
trol 50 percent of all discretionary income. AARP estimated that the average annual 
expenditure on restaurants and carryout food was $1873 for individuals in this age 
group. Suppose this estimate is based on a sample of 80 persons and that the sample 
standard deviation is $550. 

a. At 95% confidence, what is the margin of error? 

b. What is the 95% confidence interval for the population mean amount spent on 
restaurants and carryout food? 

c. What is your estimate of the total amount spent by Americans of age 50 and over 
on restaurants and carryout food? 

d. If the amount spent on restaurants and carryout food is skewed to the right, would 
you expect the median amount spent to be greater or less than $1873? 

51. Healthy Sleep Duration. The Centers for Disease Control and Prevention (CDC) de- 
fines a healthy sleep duration to be at least seven hours per day. The CDC reports that 
the percentage of people who report a healthy sleep duration varies by marital status. 
The CDC also reports that in 2018, 67% of those who are married report a healthy 
sleep duration; 62% of those who have never been married report a healthy sleep dura- 
tion; and 56% of those who are divorced, widowed, or separated report a healthy sleep 
duration. The file SleepHabits contains sample data on the sleeping habits of people 
who have never been married that are consistent with the CDC’s findings. Use these 
data to answer the following questions. 

a. Develop a point estimate and a 95% confidence interval for the proportion of those 
who have never been married who report a healthy sleep duration. 

b. Develop a point estimate and a 95% confidence interval for the mean number of 
hours of sleep for those who have never been married. 

c. For those who have never been married, estimate the number of hours of sleep per 
day for those who report a healthy sleep duration. 

3 52. Health Care Expenditures. The Health Care Cost Institute tracks health care expendit- 

DATA f ile ures for beneficiaries under the age of 65 who are covered by employer-sponsored 

DrugCost private health insurance. The data contained in the file DrugCost are consistent with the 

institute’s findings concerning annual prescription costs per employee. Analyze the data 

using Excel and answer the following questions. 

a. Develop a 90% confidence interval for the annual cost of prescription drugs. 

b. Develop a 90% confidence interval for the amount of out-of-pocket expense per 
employee. 


(@ 


DATA file 


SleepHabits 


(@ 
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c. What is your point estimate of the proportion of employees who incurred no pre- 
scription drug costs? 

d. Which, if either, of the confidence intervals in parts (a) and (b) has a larger margin 
of error. Why? 

Ss DATA fil 53. Obesity. Obesity is a risk factor for many health problems such as type 2 diabetes, high 

= f ie blood pressure, joint problems, and gallstones. Using data collected in 2018 through the 

National Health and Nutrition Examination Survey, the National Institute of Diabetes 

and Digestive and Kidney Diseases estimates that 37.7% of all adults in the United 

States have a body mass index (BMI) in excess of 30 and so are categorized as obese. 

The data in the file Obesity are consistent with these findings. 

a. Use the Obesity data set to develop a point estimate of the BMI for adults in the 
United States. Are adults in the United States obese on average? 

b. What is the sample standard deviation? 

c. Develop a 95% confidence interval for the BMI of adults in the United States. 

54. Automobile Mileage Tests. Mileage tests are conducted for a particular model of 
automobile. If a 98% confidence interval with a margin of error of | mile per gallon is 
desired, how many automobiles should be used in the test? Assume that preliminary 
mileage tests indicate the standard deviation is 2.6 miles per gallon. 

55. Patient Treatment Times. In developing patient appointment schedules, a medical 
center wants to estimate the mean time that a staff member spends with each patient. 
How large a sample should be taken if the desired margin of error is two minutes at a 
95% level of confidence? How large a sample should be taken for a 99% level of con- 
fidence? Use a planning value for the population standard deviation of eight minutes. 

56. CEO Compensation. Annual salary plus bonus data for chief executive officers are 
presented in an annual pay survey. A preliminary sample showed that the standard 
deviation is $675 with data provided in thousands of dollars. How many chief execu- 
tive officers should be in a sample if we want to estimate the population mean annual 
salary plus bonus with a margin of error of $100,000? (Note: The desired margin of 
error would be E = 100 if the data are in thousands of dollars.) Use 95% confidence. 

57. Paying for College Tuition. The National Center for Education Statistics reported that 
47% of college students work to pay for tuition and living expenses. Assume that a 
sample of 450 college students was used in the study. 

a. Provide a 95% confidence interval for the population proportion of college students 
who work to pay for tuition and living expenses. 

b. Provide a 99% confidence interval for the population proportion of college students 
who work to pay for tuition and living expenses. 

c. What happens to the margin of error as the confidence is increased from 95% 
to 99%? 

58. Parenting Time. A USA Today/CNN/Gallup survey of 369 working parents found 200 
who said they spend too little time with their children because of work commitments. 
a. What is the point estimate of the proportion of the population of working parents who 

feel they spend too little time with their children because of work commitments? 

b. At 95% confidence, what is the margin of error? 

c. What is the 95% confidence interval estimate of the population proportion of work- 
ing parents who feel they spend too little time with their children because of work 
commitments? 

59. Social Media Usage. The Pew Research Center has conducted extensive research 
on social media usage. One finding, reported in June 2018, was that 78% of adults 
aged 18 to 24 use Snapchat. Another finding was that 45% of those aged 18 to 
24 use Twitter. Assume the sample size associated with both findings is 500. 

a. Develop a 95% confidence interval for the proportion of adults aged 18 to 24 who 
use Snapchat. 

b. Develop a 99% confidence interval for the proportion of adults aged 18 to 24 who 
use Twitter. 

c. In which case, part (a) or part (b), is the margin of error larger? Explain why. 


Obesity 
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60. Importance of Economy to Voters. A survey of 750 likely voters in Ohio was con- 
ducted by the Rasmussen Poll just prior to the general election. The state of the econ- 
omy was thought to be an important determinant of how people would vote. Among 
other things, the survey found that 165 of the respondents rated the economy as good 
or excellent and 315 rated the economy as poor. 

a. Develop a point estimate of the proportion of likely voters in Ohio who rated the 
economy as good or excellent. 

b. Construct a 95% confidence interval for the proportion of likely voters in Ohio who 
rated the economy as good or excellent. 

c. Construct a 95% confidence interval for the proportion of likely voters in Ohio who 
rated the economy as poor. 

d. Which of the confidence intervals in parts (b) and (c) is wider? Why? 

61. Smoking. In 2014, the Centers for Disease Control and Prevention reported the per- 
centage of people 18 years of age and older who smoke. Suppose that a study designed 
to collect new data on smokers and nonsmokers uses a preliminary estimate of the 
proportion who smoke of .30. 

a. How large a sample should be taken to estimate the proportion of smokers in the 
population with a margin of error of .02? Use 95% confidence? 

b. Assume that the study uses your sample size recommendation in part (a) and 
finds 520 smokers. What is the point estimate of the proportion of smokers in the 
population? 

c. What is the 95% confidence interval for the proportion of smokers in the population? 

62. Credit Card Balances. A well-known bank credit card firm wishes to estimate the 
proportion of credit card holders who carry a nonzero balance at the end of the month and 
incur an interest charge. Assume that the desired margin of error is .03 at 98% confidence. 
a. How large a sample should be selected if it is anticipated that roughly 70% of the 

firm’s card holders carry a nonzero balance at the end of the month? 

b. How large a sample should be selected if no planning value for the proportion could 
be specified? 

63. Credit Card Ownership. Credit card ownership varies across age groups. In 2018, 
CreditCards.com estimated that the percentage of people who own at least one 
credit card is 67% in the 18—24 age group, 83% in the 25-34 age group, 76% in the 
35-49 age group, and 78% in the 50+ age group. Suppose these estimates are based 
on 455 randomly selected people from each age group. 

a. Construct a 95% confidence interval for the proportion of people in each of these 
age groups who owns at least one credit card. 

b. Assuming the same sample size will be used in each age group, how large would 
the sample need to be to ensure that the margin of error is .03 or less for each of the 
four confidence intervals? 

64. Factors in Choosing an Airline. Although airline schedules and cost are important 
factors for business travelers when choosing an airline carrier, a USA Today survey 
found that business travelers list an airline’s frequent flyer program as the most impor- 
tant factor. From a sample of n = 1993 business travelers who responded to the survey, 
618 listed a frequent flyer program as the most important factor. 

a. What is the point estimate of the proportion of the population of business travelers 
who believe a frequent flyer program is the most important factor when choosing 
an airline carrier? 

b. Develop a 95% confidence interval estimate of the population proportion. 

c. How large a sample would be required to report the margin of error of .01 at 95% 
confidence? Would you recommend that USA Today attempt to provide this degree 
of precision? Why or why not? 

DATA fi le 65. Driving Speeds. Huston Systems Private Limited reports that smart traffic signals and 

signs can measure a passing vehicle’s speed. Consider the speeds of 15,717 vehicles 
collected as they passed 35 MPH speed limit signs throughout the United States in 2018 
that are provided in the file 35MPH. 


(@ 


35MPH 
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a. What is the sample mean speed of U.S. drivers in a 35-mph zone? 

b. Using 95% confidence, what is the margin of error? 

c. Using the results from parts (a) and (b), develop the 95% confidence interval estim- 
ate of the mean speed of U.S. drivers in a 35-mph zone. 

DATA fi le 66. Underemployment. A recent survey from PayScale found that 46% of U.S. workers— 

roughly 22 million—are underemployed, either working part-time or at jobs that don’t 
allow them to use their education or skills. Suppose that the numbers of hours worked in 
the past week were collected from a random sample of 28,585 of these workers. The data 
collected in this survey are provided in the file UnderEmployed. 

a. What is the sample mean number of hours worked by underemployed U.S. workers? 

b. Using 99% confidence, what is the margin of error? 

c. Using the results from parts (a) and (b), develop the 99% confidence interval estim- 
ate of the mean number of hours worked by underemployed U.S. workers. 

d. If the mean hours worked by underemployed U.S. workers during the same week 
one year ago was 35.6, what would the confidence interval in part (c) lead you to 
conclude about last week? 

© DATA fi le 67. FTC Fraud Reports. In 2017, 42.54% of the nearly 2.7 million reports taken nationwide 

=á by the Federal Trade Commission’s Consumer Sentinel Network dealt with instances of 
fraud. Consider results of a random sample of 42,296 of the reports taken by the Con- 
sumer Sentinel Network from Florida that are provided the file FloridaFraud. 

a. What is the sample proportion of reports filed from Florida that dealt with instances 
of fraud? 

b. Using 95% confidence, what is the margin of error? 

c. Using the results from parts (a) and (b), develop the 95% confidence interval estim- 
ate of the proportion reports filed from Florida that dealt with instances of fraud. 
What do you conclude about Florida from these results? 

68. Structurally Deficient Bridges. The Infrastructure Report Card (IRC) reports that 
of 614,387 U.S. bridges, 9.1% were structurally deficient as of last year. The IRC 
also reports that more than 1300 California bridges fall under this category. How 
does California compare to the nation? Consider a random sample of 8749 bridges in 
California that includes 490 structurally deficient bridges. 

a. What is the sample proportion of structurally deficient bridges in California? 

b. Using 90% confidence, what is the margin of error? 

c. Using the results from parts (a) and (b), develop the 90% confidence interval esti- 
mate of the proportion of structurally deficient bridges in California. 

d. What does the confidence interval in part (c) lead you to conclude about California 
bridges? 


(@ 


UnderEmployed 


FloridaFraud 


CASE PROBLEM 1: YOUNG PROFESSIONAL 
MAGAZINE | 


Young Professional magazine was developed for a target audience of recent college grad- 
uates who are in their first 10 years in a business/professional career. In its two years of 
publication, the magazine has been fairly successful. Now the publisher is interested in ex- 
panding the magazine’s advertising base. Potential advertisers continually ask about the de- 
mographics and interests of subscribers to Young Professional. To collect this information, 
the magazine commissioned a survey to develop a profile of its subscribers. The survey 
results will be used to help the magazine choose articles of interest and provide advertisers 
with a profile of subscribers. As a new employee of the magazine, you have been asked to 
help analyze the survey results. 

Some of the survey questions follow: 


1. What is your age? 
2. Are you: Male Female 
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TABLE 8.8 Partial Survey Results for Young Professional Magazine 


Real Estate Value of Number of Internet Household 
Age Gender Purchases Investments ($) Transactions Access Income ($) Children 
38 Female No 12,200 4 Yes 75,200 Yes 
30 Male No 12,400 4 Yes 70,300 Yes 
41 Female No 26,800 5 Yes 48,200 No 
28 Female Yes 19,600 6 No 95,300 No 
5 


31 Female Yes 15,100 No 73,300 Yes 


3. Do you plan to make any real estate purchases in the next two years? 
Yes No 

4. What is the approximate total value of financial investments, exclusive of your home, 
owned by you or members of your household? 


5. How many stock/bond/mutual fund transactions have you made in the past year? 
6. Do you have broadband access to the Internet at home? Yes No 
7. Please indicate your total household income last year. 
8. Do you have children? Yes No 
Ss DATA fi le The file Professional contains the responses to these questions. Table 8.8 shows the 
= portion of the file pertaining to the first five survey respondents. 


Professional 


Managerial Report 

Prepare a managerial report summarizing the results of the survey. In addition to statistical 
summaries, discuss how the magazine might use these results to attract advertisers. You 
might also comment on how the survey results could be used by the magazine’s editors to 
identify topics that would be of interest to readers. Your report should address the follow- 
ing issues, but do not limit your analysis to just these areas. 


1. Develop appropriate descriptive statistics to summarize the data. 

2. Develop 95% confidence intervals for the mean age and household income of 
subscribers. 

3. Develop 95% confidence intervals for the proportion of subscribers who have Internet 
access at home and the proportion of subscribers who have children. 

4. Would Young Professional be a good advertising outlet for online brokers? Justify your 
conclusion with statistical data. 

5. Would this magazine be a good place to advertise for companies selling educational 
software and computer games for young children? 

6. Comment on the types of articles you believe would be of interest to readers of Young 
Professional. 


CASE PROBLEM 2: GULF REAL ESTATE 
PROPERTIES 


SSCCSSCHSSOSSHSSESSHSSHSSHSSHSSSSHSSSSHSSSSHSSSSSSHSSCSSSSHESSSHSSSHSSSESEESESE 


Gulf Real Estate Properties, Inc. is a real estate firm located in southwest Florida. The 
company, which advertises itself as “expert in the real estate market,’ monitors condomin- 
ium sales by collecting data on location, list price, sale price, and number of days it takes 
to sell each unit. Each condominium is classified as Gulf View if it is located directly on the 
Gulf of Mexico or No Gulf View if it is located on the bay or a golf course, near but not on 
the Gulf. Sample data from the Multiple Listing Service in Naples, Florida, provided sales 
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TABLE 8.9 Sales Data for Gulf Real Estate Properties 


Gulf View Condominiums No Gulf View Condominiums 
List Price Sale Price Days to Sell List Price Sale Price Days to Sell 
495.0 475.0 130 217.0 217.0 182 
379.0 350.0 TA 148.0 13955 338 
529.0 519.0 85 186.5 179.0 122 
552.5 5245 95 23910 230.0 150 
334.9 334.9 119 279.0 2075 169 
550.0 505.0 92 215.0 214.0 58 
169.9 165.0 197 279.0 259.0 110 
210.0 210.0 56 W/E Ors) 130 
97510) 945.0 73 149.9 144.9 149 
=> : 314.0 314.0 126 235.0 230.0 114 
= DATA file 315.0 305.0 88 199.8 1920 120 
GuttProp 885.0 800.0 282 210.0 195.0 61 
975:0 975.0 100 226.0 212.0 146 
469.0 445.0 56 149.9 146.5 17 
329.0 305.0 49 160.0 160.0 281 
365.0 330.0 48 32210 29215; 63 
332.0 312.0 88 187.5 1790 48 
520.0 495.0 161 247.0 227.0 52 
425.0 405.0 149 
675.0 669.0 142 
409.0 400.0 28 
649.0 649.0 29 
311920 305.0 140 
425.0 410.0 85 
359.0 340.0 107 
469.0 449.0 72 
895.0 875.0 IZ 
439.0 430.0 160 
435.0 400.0 206 
235.0 227.0 cal 
638.0 618.0 100 
629.0 600.0 Oy 
329.0 309.0 114 
59510 555.0 45 
339.0 315.0 150 
215.0 200.0 48 
3950 375.0 185 
449.0 425.0 53 
499.0 465.0 86 
439.0 428.5 158 


data for 40 Gulf View condominiums and 18 No Gulf View condominiums.* Prices are in 
thousands of dollars. The data are shown in Table 8.9. 
Managerial Report 


1. Use appropriate descriptive statistics to summarize each of the three variables for the 
40 Gulf View condominiums. 


*Data based on condominium sales reported in the Naples MLS (Coldwell Banker, June 2000). 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


Case Problem 3: Metropolitan Research, Inc. 395 


2. Use appropriate descriptive statistics to summarize each of the three variables for the 
18 No Gulf View condominiums. 

3. Compare your summary results. Discuss any specific statistical results that would help a 
real estate agent understand the condominium market. 

4. Develop a 95% confidence interval estimate of the population mean sales price and 
population mean number of days to sell for Gulf View condominiums. Interpret 
your results. 

5. Develop a 95% confidence interval estimate of the population mean sales price and 
population mean number of days to sell for No Gulf View condominiums. Interpret 
your results. 

6. Assume the branch manager requested estimates of the mean selling price of Gulf View 
condominiums with a margin of error of $40,000 and the mean selling price of No Gulf 
View condominiums with a margin of error of $15,000. Using 95% confidence, how 
large should the sample sizes be? 

7. Gulf Real Estate Properties just signed contracts for two new listings: a Gulf View 
condominium with a list price of $589,000 and a No Gulf View condominium with a list 
price of $285,000. What is your estimate of the final selling price and number of days 
required to sell each of these units? 


CASE PROBLEM 3: METROPOLITAN 


Metropolitan Research, Inc., a consumer research organization, conducts surveys de- 
signed to evaluate a wide variety of products and services available to consumers. In one 
particular study, Metropolitan looked at consumer satisfaction with the performance of 
automobiles produced by a major Detroit manufacturer. A questionnaire sent to owners of 
one of the manufacturer’s full-sized cars revealed several complaints about early transmis- 
sion problems. To learn more about the transmission failures, Metropolitan used a sample 
of actual transmission repairs provided by a transmission repair firm in the Detroit area. 
The following data show the actual number of miles driven for 50 vehicles at the time of 
transmission failure. 


85,092 32,609 59,465 77,437 32,534 64,090 32,464 59,902 
S DATA file 39,323 89,641 94,219 116,803 92,857 63,436 65,605 85,861 
Auto 64,342 61,978 67,998 59,817 101,769 95,774 121,352 69,568 
74,276 66,998 40,001 72,069 25,066 77,098 69,922 35,662 
74,425 67,202 118,444 53,500 79,294 64,544 86,813 116,269 
37,831 89,341 73,341 85,288 138,114 53,402 85,586 82,256 

77,539 88,798 


Managerial Report 


1. Use appropriate descriptive statistics to summarize the transmission failure data. 

2. Develop a 95% confidence interval for the mean number of miles driven until trans- 
mission failure for the population of automobiles with transmission failure. Provide a 
managerial interpretation of the interval estimate. 

3. Discuss the implication of your statistical findings in terms of the belief that some own- 
ers of the automobiles experienced early transmission failures. 

4. How many repair records should be sampled if the research firm wants the population 
mean number of miles driven until transmission failure to be estimated with a margin of 
error of 5000 miles? Use 95% confidence. 

5. What other information would you like to gather to evaluate the transmission failure 
problem more fully? 
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STATISTICS IN PRACTICE 


John Morrell & Company* 
CINCINNATI, OHIO 


John Morrell & Company, which began in England in 
1827, is considered the oldest continuously operating 
meat manufacturer in the United States. It is a wholly 
owned and independently managed subsidiary of 
Smithfield Foods, Smithfield, Virginia. John Morrell & 
Company offers an extensive product line of processed 
meats and fresh pork to consumers under 13 regional 
brands, including John Morrell, E-Z-Cut, Tobin's First 
Prize, Dinner Bell, Hunter, Kretschmar, Rath, Rodeo, 
Shenson, Farmers Hickory Brand, lowa Quality, and 
Peyton's. Each regional brand enjoys high brand recog- 
nition and loyalty among consumers. 

Market research at Morrell provides management 
with up-to-date information on the company’s various 
products and how the products compare with compet- 
ing brands of similar products. A recent study compared 
a Beef Pot Roast made by Morrell to similar beef prod- 
ucts from two major competitors. In the three-product 
comparison test, a sample of consumers was used 
to indicate how the products rated in terms of taste, 
appearance, aroma, and overall preference. 

One research question concerned whether the Beef 
Pot Roast made by Morrell was the preferred choice of 
more than 50% of the consumer population. Letting p 
indicate the population proportion preferring Morrell’s 
product, the hypothesis test for the research question is 
as follows: 


Ho: p = .50 
nL 3 [D> 0 


The null hypothesis H, indicates that the preference 
for Morrell’s product is less than or equal to 50%. If 
the sample data support rejecting H, in favor of the 
alternative hypothesis H,, Morrell will draw the research 
conclusion that in a three-product comparison, its 


*The authors are indebted to Marty Butler, Vice President of Marketing, 
John Morrell, for providing this Statistics in Practice. 


JOHN 
MORRELL. 


Hypothesis testing helps John Morrell & Company 
analyze market research about its products. 
RosalreneBetancourt 12/Alamy Stock Photo 


Beef Pot Roast is preferred by more than 50% of the 
consumer population. 

In an independent taste test study using a sample 
of 224 consumers in Cincinnati, Milwaukee, and Los 
Angeles, 150 consumers selected the Beef Pot Roast 
made by Morrell as the preferred product. Using statis- 
tical hypothesis testing procedures, the null hypoth- 
esis Ho was rejected. The study provided statistical 
evidence supporting H, and the conclusion that the 
Morrell product is preferred by more than 50% of the 
consumer population. 

The point estimate of the population proportion 
was p = 150/224 = .67. Thus, the sample data 
provided support for a food magazine advertisement 
showing that in a three-product taste comparison, Beef 
Pot Roast made by Morrell was “preferred 2 to 1 over 
the competition.” 

In this chapter we will discuss how to formulate 
hypotheses and how to conduct tests like the one used 
by Morrell. Through the analysis of sample data, we will 
be able to determine whether a hypothesis should or 
should not be rejected. 


In Chapters 7 and 8 we showed how a sample could be used to develop point and interval 
estimates of population parameters. In this chapter we continue the discussion of statistical 
inference by showing how hypothesis testing can be used to determine whether a statement 
about the value of a population parameter should or should not be rejected. 

In hypothesis testing we begin by making a tentative assumption about a population 
parameter. This tentative assumption is called the null hypothesis and is denoted by Hp. 
We then define another hypothesis, called the alternative hypothesis, which is the oppo- 
site of what is stated in the null hypothesis. The alternative hypothesis is denoted by H,. 
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Learning to formulate 
hypotheses correctly will 
take some practice. Expect 
some initial confusion 

over the proper choice of 
the null and alternative 
hypotheses. The examples 
in this section are intended 


to provide guidelines. 


The conclusion that the 
research hypothesis is true 
is made if the sample data 
provide sufficient evidence 
to show that the null 
hypothesis can be rejected. 
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The hypothesis testing procedure uses data from a sample to test the two competing state- 
ments indicated by H} and H,. 

This chapter shows how hypothesis tests can be conducted about a population mean 
and a population proportion. We begin by providing examples that illustrate approaches to 
developing null and alternative hypotheses. 


9.1 Developing Null and Alternative Hypotheses 


It is not always obvious how the null and alternative hypotheses should be formulated. 
Care must be taken to structure the hypotheses appropriately so that the hypothesis testing 
conclusion provides the information the researcher or decision maker wants. The context 
of the situation is very important in determining how the hypotheses should be stated. All 
hypothesis testing applications involve collecting a sample and using the sample results to 
provide evidence for drawing a conclusion. Good questions to consider when formulating 
the null and alternative hypotheses are, What is the purpose of collecting the sample? What 
conclusions are we hoping to make? 

In the chapter introduction, we stated that the null hypothesis Hp is a tentative assump- 
tion about a population parameter such as a population mean or a population proportion. 
The alternative hypothesis H, is a statement that is the opposite of what is stated in the null 
hypothesis. In some situations it is easier to identify the alternative hypothesis first and then 
develop the null hypothesis. In other situations it is easier to identify the null hypothesis 
first and then develop the alternative hypothesis. We will illustrate these situations in the 
following examples. 


The Alternative Hypothesis as a Research Hypothesis 


Many applications of hypothesis testing involve an attempt to gather evidence in support 
of a research hypothesis. In these situations, it is often best to begin with the alternative 
hypothesis and make it the conclusion that the researcher hopes to support. Consider a 
particular automobile that currently attains a fuel efficiency of 24 miles per gallon in city 
driving. A product research group has developed a new fuel injection system designed to 
increase the miles-per-gallon rating. The group will run controlled tests with the new fuel 
injection system looking for statistical support for the conclusion that the new fuel injec- 
tion system provides more miles per gallon than the current system. 

Several new fuel injection units will be manufactured, installed in test automobiles, and 
subjected to research-controlled driving conditions. The sample mean miles per gallon for 
these automobiles will be computed and used in a hypothesis test to determine whether 
it can be concluded that the new system provides more than 24 miles per gallon. In terms 
of the population mean miles per gallon u, the research hypothesis u > 24 becomes the 
alternative hypothesis. Since the current system provides an average or mean of 24 miles 
per gallon, we will make the tentative assumption that the new system is not any better 
than the current system and choose u = 24 as the null hypothesis. The null and alternative 
hypotheses are: 


Ay: p 5 24 
H; u > 24 


If the sample results lead to the conclusion to reject Hp, the inference can be made that 
H: w > 24 is true. The researchers have the statistical support to state that the new fuel 
injection system increases the mean number of miles per gallon. The production of auto- 
mobiles with the new fuel injection system should be considered. However, if the sample 
results lead to the conclusion that Hj cannot be rejected, the researchers cannot conclude 
that the new fuel injection system is better than the current system. Production of auto- 
mobiles with the new fuel injection system on the basis of better gas mileage cannot be 

. justified. Perhaps more research and further testing can be conducted. 
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Successful companies stay competitive by developing new products, new meth- 
ods, new systems, and the like that are better than what is currently available. Before 
adopting something new, it is desirable to conduct research to determine whether there 
is statistical support for the conclusion that the new approach is indeed better. In such 
cases, the research hypothesis is stated as the alternative hypothesis. For example, a new 
teaching method is developed that is believed to be better than the current method. The 
alternative hypothesis is that the new method is better. The null hypothesis is that the 
new method is no better than the old method. A new sales force bonus plan is developed 
in an attempt to increase sales. The alternative hypothesis is that the new bonus plan 
increases sales. The null hypothesis is that the new bonus plan does not increase sales. 
A new drug is developed with the goal of lowering blood pressure more than an exist- 
ing drug. The alternative hypothesis is that the new drug lowers blood pressure more 
than the existing drug. The null hypothesis is that the new drug does not provide lower 
blood pressure than the existing drug. In each case, rejection of the null hypothesis Hp 
provides statistical support for the research hypothesis. We will see many examples of 
hypothesis tests in research situations such as these throughout this chapter and in the 
remainder of the text. 


The Null Hypothesis as an Assumption to Be Challenged 


Of course, not all hypothesis tests involve research hypotheses. In the following discussion 
we consider applications of hypothesis testing where we begin with a belief or an assump- 
tion that a statement about the value of a population parameter is true. We will then use 

a hypothesis test to challenge the assumption and determine whether there is statistical 
evidence to conclude that the assumption is incorrect. In these situations, it is helpful to 
develop the null hypothesis first. The null hypothesis Hp expresses the belief or assumption 
about the value of the population parameter. The alternative hypothesis H, is that the belief 
or assumption is incorrect. 

As an example, consider the situation of a manufacturer of soft drink products. The 
label on a soft drink bottle states that it contains 67.6 fluid ounces. We consider the label 
correct provided the population mean filling volume for the bottles is at least 67.6 fluid 
ounces. Without any reason to believe otherwise, we would give the manufacturer the ben- 
efit of the doubt and assume that the statement provided on the label is correct. Thus, in a 
hypothesis test about the population mean fluid ounces per bottle, we would begin with the 
assumption that the label is correct and state the null hypothesis as u = 67.6. The challenge 
to this assumption would imply that the label is incorrect and the bottles are being under- 
filled. This challenge would be stated as the alternative hypothesis u < 67.6. Thus, the null 
and alternative hypotheses are: 


Ay: u = 67.6 
H: m < 67.6 


A government agency with the responsibility for validating manufacturing labels could 
A manufacturer's product select a sample of soft drink bottles, compute the sample mean fluid ounces, and use the 


information is usually sample results to test the preceding hypotheses. If the sample results lead to the conclusion 
assumed to be true and to reject H,, the inference that H}: u < 67.6 is true can be made. With this statistical sup- 
stated as the null port, the agency is justified in concluding that the label is incorrect and underfilling of the 


hypothesis. The conclusion bottles is occurring. Appropriate action to force the manufacturer to comply with labeling 
that the information is in- standards would be considered. However, if the sample results indicate H, cannot be 

correct can be made if the rejected, the assumption that the manufacturer’s labeling is correct cannot be rejected. With 
null hypothesis is rejected. this conclusion, no action would be taken. 

Let us now consider a variation of the soft drink bottle-filling example by viewing the 
same situation from the manufacturer’s point of view. The bottle-filling operation has been 
designed to fill soft drink bottles with 67.6 fluid ounces as stated on the label. The com- 
pany does not want to underfill the containers because that could result in an underfilling 
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complaint from customers or, perhaps, a government agency. However, the company does 
not want to overfill containers either because putting more soft drink than necessary into 
the containers would be an unnecessary cost. The company’s goal would be to adjust the 
bottle-filling operation so that the population mean filling weight per bottle is 67.6 fluid 
ounces as specified on the label. 

Although this is the company’s goal, from time to time any production process can 
get out of adjustment. If this occurs in our example, underfilling or overfilling of the soft 
drink bottles will occur. In either case, the company would like to know about it in order 
to correct the situation by readjusting the bottle-filling operation to the designed 67.6 fluid 
ounces. In this hypothesis testing application, we would begin with the assumption that the 
production process is operating correctly and state the null hypothesis as u = 67.6 fluid 
ounces. The alternative hypothesis that challenges this assumption is that  # 67.6, which 
indicates either overfilling or underfilling is occurring. The null and alternative hypotheses 
for the manufacturer’s hypothesis test are 


Hy: u = 67.6 
H,: m # 67.6 


Suppose that the soft drink manufacturer uses a quality control procedure to periodically 
select a sample of bottles from the filling operation and computes the sample mean fluid 
ounces per bottle. If the sample results lead to the conclusion to reject H,, the inference is 
made that H,: u # 67.6 is true. We conclude that the bottles are not being filled properly 
and the production process should be adjusted to restore the population mean to 67.6 fluid 
ounces per bottle. However, if the sample results indicate H, cannot be rejected, the as- 
sumption that the manufacturer’s bottle-filling operation is functioning properly cannot be 
rejected. In this case, no further action would be taken and the production operation would 
continue to run. 

The two preceding forms of the soft drink manufacturing hypothesis test show that the 
null and alternative hypotheses may vary depending upon the point of view of the researcher 
or decision maker. To formulate hypotheses correctly it is important to understand the con- 
text of the situation and structure the hypotheses to provide the information the researcher or 
decision maker wants. 


Summary of Forms for Null and Alternative Hypotheses 


The hypothesis tests in this chapter involve two population parameters: the population 
mean and the population proportion. Depending on the situation, hypothesis tests about 

a population parameter may take one of three forms: Two use inequalities in the null hy- 
pothesis; the third uses an equality in the null hypothesis. For hypothesis tests involving a 
population mean, we let mọ denote the hypothesized value and we must choose one of the 
following three forms for the hypothesis test. 


The three possible forms of 


Ay: u = Mo Ay: M = Mo Ay: M = Mo 


hypotheses Hy and H, are 
te o Hiu <m Ay > po Ay BF m 


shown here. Note that the 
equality always appears in For reasons that will be clear later, the first two forms are called one-tailed tests. The third 
the null hypothesis Hg. form is called a two-tailed test. 

In many situations, the choice of H) and H, is not obvious and judgment is necessary 
to select the proper form. However, as the preceding forms show, the equality part of the 
expression (either =, =, or =) always appears in the null hypothesis. In selecting the 
proper form of H) and H,, keep in mind that the alternative hypothesis is often what the 
test is attempting to establish. Hence, asking whether the user is looking for evidence to 
support u < Mo, M > Ho, OF  # po Will help determine H,. The following exercises are 
designed to provide practice in choosing the proper form for a hypothesis test involving a 
population mean. 
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EXERCISES 


1. Hotel Guest Bills. The manager of the Danvers-Hilton Resort Hotel stated that the 
mean guest bill for a weekend is $600 or less. A member of the hotel’s accounting 
staff noticed that the total charges for guest bills have been increasing in recent 
months. The accountant will use a sample of future weekend guest bills to test the 
manager’s claim. 

a. Which form of the hypotheses should be used to test the manager’s claim? Explain. 


Hy: m = 600 HA: u = 600 Hy: u = 600 
H: m < 600 H,: u > 600 Hy e # 600 


b. What conclusion is appropriate when H, cannot be rejected? 
c. What conclusion is appropriate when Hy can be rejected? 

2. Bonus Plan’s Effect on Automobile Sales. The manager of an automobile dealership 
is considering a new bonus plan designed to increase sales volume. Currently, the 
mean sales volume is 14 automobiles per month. The manager wants to conduct a re- 
search study to see whether the new bonus plan increases sales volume. To collect data 
on the plan, a sample of sales personnel will be allowed to sell under the new bonus 
plan for a one-month period. 

a. Develop the null and alternative hypotheses most appropriate for this situation. 
b. Comment on the conclusion when Hy cannot be rejected. 
c. Comment on the conclusion when Ho can be rejected. 

3. Filling Detergent Cartons. A production line operation is designed to fill cartons with 
laundry detergent to a mean weight of 32 ounces. A sample of cartons is periodically 
selected and weighed to determine whether underfilling or overfilling is occurring. If 
the sample data lead to a conclusion of underfilling or overfilling, the production line 
will be shut down and adjusted to obtain proper filling. 

a. Formulate the null and alternative hypotheses that will help in deciding whether to 
shut down and adjust the production line. 

b. Comment on the conclusion and the decision when Hp cannot be rejected. 

c. Comment on the conclusion and the decision when Hp can be rejected. 

4. Process Improvement. Because of high production-changeover time and costs, a 
director of manufacturing must convince management that a proposed manufactur- 
ing method reduces costs before the new method can be implemented. The current 
production method operates with a mean cost of $220 per hour. A research study will 
measure the cost of the new method over a sample production period. 

a. Develop the null and alternative hypotheses most appropriate for this study. 
b. Comment on the conclusion when Hy cannot be rejected. 
c. Comment on the conclusion when H can be rejected. 


9.2 Type | and Type II Errors 


The null and alternative hypotheses are competing statements about the population. Either 
the null hypothesis H, is true or the alternative hypothesis H, is true, but not both. Ideally the 
hypothesis testing procedure should lead to the acceptance of Hy) when H, is true and the re- 
jection of H) when H, is true. Unfortunately, the correct conclusions are not always possible. 
Because hypothesis tests are based on sample information, we must allow for the possibility 
of errors. Table 9.1 illustrates the two kinds of errors that can be made in hypothesis testing. 

The first row of Table 9.1 shows what can happen if the conclusion is to accept Ho. If Hy 
is true, this conclusion is correct. However, if H, is true, we make a Type II error; that is, 
we accept H) when it is false. The second row of Table 9.1 shows what can happen if the 
conclusion is to reject Ho. If Hy is true, we make a Type I error; that is, we reject H) when 
it is true. However, if H, is true, rejecting H, is correct. 

Recall the hypothesis testing illustration discussed in Section 9.1, in which an automobile 
product research group developed a new fuel injection system designed to increase the 
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If the sample data are 
consistent with the null 
hypothesis Ho, we will 
follow the practice of 
concluding “do not reject 
Ho.” This conclusion is 
preferred over “accept Ho,” 
because the conclusion to 
accept Ho puts us at risk of 
making a Type II error. 


9.2 Type land Type Il Errors 403 


TABLE 9.1 Errors and Correct Conclusions in Hypothesis Testing 


Population Condition 


Hp True Ha True 


Correct Type Il 
Conclusion Error 


Conclusion 


Type | Correct 
Error Conclusion 


miles-per-gallon rating of a particular automobile. With the current model obtaining an 
average of 24 miles per gallon, the hypothesis test was formulated as follows. 


Hy: u = 24 
H; u > 24 


The alternative hypothesis, H,: u > 24, indicates that the researchers are looking for 
sample evidence to support the conclusion that the population mean miles per gallon with 
the new fuel injection system is greater than 24. 

In this application, the Type I error of rejecting H, when it is true corresponds to the 
researchers claiming that the new system improves the miles-per-gallon rating (u > 24) 
when in fact the new system is not any better than the current system. In contrast, the 
Type II error of accepting Hy when it is false corresponds to the researchers concluding 
that the new system is not any better than the current system (u < 24) when in fact the new 
system improves miles-per-gallon performance. 

For the miles-per-gallon rating hypothesis test, the null hypothesis is Hp: u = 24. 
Suppose the null hypothesis is true as an equality; that is, u = 24. The probability of 
making a Type I error when the null hypothesis is true as an equality is called the level of 
significance. Thus, for the miles-per-gallon rating hypothesis test, the level of significance 
is the probability of rejecting Hy: u = 24 when u = 24. Because of the importance of this 
concept, we now restate the definition of level of significance. 


LEVEL OF SIGNIFICANCE 


The level of significance is the probability of making a Type I error when the null 
hypothesis is true as an equality. 


The Greek symbol a (alpha) is used to denote the level of significance, and common 
choices for « are .05 and .01. 

In practice, the person responsible for the hypothesis test specifies the level of signifi- 
cance. By selecting a, that person is controlling the probability of making a Type I error. 
If the cost of making a Type I error is high, small values of œ are preferred. If the cost of 
making a Type I error is not too high, larger values of a are typically used. Applications of 
hypothesis testing that only control for the Type I error are called significance tests. Many 
applications of hypothesis testing are of this type. 

Although most applications of hypothesis testing control for the probability of making 
a Type I error, they do not always control for the probability of making a Type II error. 
Hence, if we decide to accept Hp, we cannot determine how confident we can be with 
that decision. Because of the uncertainty associated with making a Type II error when 
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conducting significance tests, statisticians usually recommend that we use the statement 
“do not reject H,” instead of “accept H,” Using the statement “do not reject H,” carries 
the recommendation to withhold both judgment and action. In effect, by not directly 
accepting Hp, the statistician avoids the risk of making a Type II error. Whenever the prob- 
ability of making a Type II error has not been determined and controlled, we will not make 
the statement “accept H,.” In such cases, only two conclusions are possible: do not reject 
H or reject Hy. 

Although controlling for a Type II error in hypothesis testing is not common, it can be 
done. More advanced texts describe procedures for determining and controlling the prob- 
ability of making a Type II error.' If proper controls have been established for this error, 
action based on the “accept H,” conclusion can be appropriate. 


NOTE + COMMENT 


Walter Williams, syndicated columnist and professor of eco- 
nomics at George Mason University, points out that the possi- 
bility of making a Type | or a Type II error is always present in 
decision making (The Cincinnati Enquirer). He notes that the 
Food and Drug Administration runs the risk of making these 
errors in its drug approval process. The FDA must either 


approve a new drug or not approve it. Thus, the FDA runs 
the risk of making a Type | error by approving a new drug that 
is not safe and effective, or making a Type II error by failing 
to approve a new drug that is safe and effective. Regardless 
of the decision made, the possibility of making a costly error 
cannot be eliminated. 


EXERCISES 


5. Beer and Cider Consumption. According to the National Beer Wholesalers Associ- 
ation, U.S. consumers 21 years and older consumed 26.9 gallons of beer and cider per 
person during 2017. A distributor in Milwaukee believes that beer and cider consump- 
tion are higher in that city. A sample of consumers 21 years and older in Milwaukee 
will be taken, and the sample mean 2017 beer and cider consumption will be used to 
test the following null and alternative hypotheses: 


Ay: p = 26.9 
H; u > 26.9 


a. Assume the sample data led to rejection of the null hypothesis. What would be your 
conclusion about consumption of beer and cider in Milwaukee? 

b. What is the Type I error in this situation? What are the consequences of making 
this error? 

c. What is the Type II error in this situation? What are the consequences of making 
this error? 

6. Orange Juice Labels. The label on a 3-quart container of orange juice states that the 
orange juice contains an average of | gram of fat or less. Answer the following ques- 
tions for a hypothesis test that could be used to test the claim on the label. 

a. Develop the appropriate null and alternative hypotheses. 

b. What is the Type I error in this situation? What are the consequences of making 
this error? 

c. What is the Type H error in this situation? What are the consequences of making 
this error? 

7. Carpet Salesperson Salaries. Carpetland salespersons average $8000 per week in 
sales. Steve Contois, the firm’s vice president, proposes a compensation plan with new 
selling incentives. Steve hopes that the results of a trial selling period will enable him 
to conclude that the compensation plan increases the average sales per salesperson. 


'See, for example, D. R. Anderson, D. J. Sweeney, and T. A. Williams, Statistics for Business and Economics, 
13th edition (Cincinnati: Cengage Learning, 2018). 
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a. Develop the appropriate null and alternative hypotheses. 

b. What is the Type I error in this situation? What are the consequences of making 
this error? 

c. What is the Type II error in this situation? What are the consequences of making 
this error? 

8. Production Operating Costs. Suppose a new production method will be implemented 
if a hypothesis test supports the conclusion that the new method reduces the mean 
operating cost per hour. 

a. State the appropriate null and alternative hypotheses if the mean cost for the current 
production method is $220 per hour. 

b. What is the Type I error in this situation? What are the consequences of making 
this error? 

c. What is the Type II error in this situation? What are the consequences of making 
this error? 


9.3 Population Mean: ø Known 


In Chapter 8, we said that the o known case corresponds to applications in which histori- 
cal data and/or other information are available that enable us to obtain a good estimate of 
the population standard deviation prior to sampling. In such cases the population standard 
deviation can, for all practical purposes, be considered known. In this section we show how 
to conduct a hypothesis test about a population mean for the o known case. 

The methods presented in this section are exact if the sample is selected from a popula- 
tion that is normally distributed. In cases where it is not reasonable to assume the popu- 
lation is normally distributed, these methods are still applicable if the sample size is large 
enough. We provide some practical advice concerning the population distribution and the 
sample size at the end of this section. 


One-Tailed Test 


One-tailed tests about a population mean take one of the following two forms. 


Lower Tail Test Upper Tail Test 
Ho: u = bo Ho: M S Mo 
H; H < Mo H; u > Mo 


Let us consider an example involving a lower tail test. 

The Federal Trade Commission (FTC) periodically conducts statistical studies designed to 
test the claims that manufacturers make about their products. For example, the label on a large 
can of Hilltop Coffee states that the can contains 3 pounds of coffee. The FTC knows that 
Hilltop’s production process cannot place exactly 3 pounds of coffee in each can, even if the 
mean filling weight for the population of all cans filled is 3 pounds per can. However, as long 
as the population mean filling weight is at least 3 pounds per can, the rights of consumers 
will be protected. Thus, the FTC interprets the label information on a large can of coffee as a 
claim by Hilltop that the population mean filling weight is at least 3 pounds per can. We will 
show how the FTC can check Hilltop’s claim by conducting a lower tail hypothesis test. 

The first step is to develop the null and alternative hypotheses for the test. If the 
population mean filling weight is at least 3 pounds per can, Hilltop’s claim is correct. 
This establishes the null hypothesis for the test. However, if the population mean weight 
is less than 3 pounds per can, Hilltop’s claim is incorrect. This establishes the alternative 
hypothesis. With u denoting the population mean filling weight, the null and alternative 
hypotheses are as follows: 


Ay: p = 3 
A: w<3 


Note that the hypothesized value of the population mean is mọ = 3. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


406 Chapter 9 Hypothesis Tests 


If the sample data indicate that H) cannot be rejected, the statistical evidence does not 
support the conclusion that a label violation has occurred. Hence, no action should be 
taken against Hilltop. However, if the sample data indicate that H, can be rejected, we will 
conclude that the alternative hypothesis, H:  < 3, is true. In this case a conclusion of 
underfilling and a charge of a label violation against Hilltop would be justified. 

Suppose a sample of 36 cans of coffee is selected and the sample mean x is computed as 
an estimate of the population mean p. If the value of the sample mean x is less than 3 pounds, 
the sample results will cast doubt on the null hypothesis. What we want to know is how much 
less than 3 pounds must x be before we would be willing to declare the difference significant 
and risk making a Type I error by falsely accusing Hilltop of a label violation. A key factor in 
addressing this issue is the value the decision maker selects for the level of significance. 

As noted in the preceding section, the level of significance, denoted by a, is the proba- 
bility of making a Type I error by rejecting H when the null hypothesis is true as an 
equality. The decision maker must specify the level of significance. If the cost of making a 
Type I error is high, a small value should be chosen for the level of significance. If the cost 
is not high, a larger value is more appropriate. In the Hilltop Coffee study, the director of 
the FTC’s testing program made the following statement: “If the company is meeting its 
weight specifications at u = 3, I do not want to take action against them. But, I am willing 
to risk a 1% chance of making such an error.” From the director’s statement, we set the 
level of significance for the hypothesis test at a = .01. Thus, we must design the hypothe- 
sis test so that the probability of making a Type I error when p = 3 is .01. 

For the Hilltop Coffee study, by developing the null and alternative hypotheses and 
specifying the level of significance for the test, we carry out the first two steps required in 
conducting every hypothesis test. We are now ready to perform the third step of hypothesis 
testing: collect the sample data and compute the value of what is called a test statistic. 


Test Statistic For the Hilltop Coffee study, previous FTC tests show that the population 
standard deviation can be assumed known with a value of øo = .18. In addition, these 
tests also show that the population of filling weights can be assumed to have a normal 
distribution. From the study of sampling distributions in Chapter 7 we know that if the 
population from which we are sampling is normally distributed, the sampling distribution 
of x will also be normally distributed. Thus, for the Hilltop Coffee study, the sampling 
The standard error ofx is distribution of x is normally distributed. With a known value of ø = .18 and a sample size 
the standard deviation of of n = 36, Figure 9.1 shows the sampling distribution of x when the null hypothesis is 
the sampling distribution true as an equality, that is, when u = mọ = 3.” Note that the standard error of x is given by 


of x. c- = o/Vn = 18/36 = .03. 


FIGURE 9.1 Sampling Distribution of x for the Hilltop Coffee Study When 


the Null Hypothesis Is True as an Equality (u = 3) 


Sampling distribution 


Li aaa = 3 
“Vn V36 


?In constructing sampling distributions for hypothesis tests, it is assumed that Hù is satisfied as an equality. 
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A small p-value indicates 
that the value of the test 
statistic is unusual given 
the assumption that Ho 
is true. 


DATA file 


Coffee 
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Because the sampling distribution of x is normally distributed, the sampling distribu- 
tion of 


X— Mo x-3 
z= = 
Oz .03 
is a standard normal distribution. A value of z = —1 means that the value of x is one 
standard error below the hypothesized value of the mean, a value of z = —2 means that 


the value of x is two standard errors below the hypothesized value of the mean, and so on. 
We can use the standard normal probability table to find the lower tail probability corre- 
sponding to any z value. For instance, the lower tail area at z = —3.00 is .0013. Hence, the 
probability of obtaining a value of z that is three or more standard errors below the mean is 
.0013. As a result, the probability of obtaining a value of x that is 3 or more standard errors 
below the hypothesized population mean py) = 3 is also .0013. Such a result is unlikely if 
the null hypothesis is true. 

For hypothesis tests about a population mean in the ø known case, we use the standard 
normal random variable z as a test statistic to determine whether x deviates from the 
hypothesized value of u enough to justify rejecting the null hypothesis. With o; = o/ Vn, 
the test statistic is as follows. 


TEST STATISTIC FOR HYPOTHESIS TESTS ABOUT A POPULATION MEAN: o KNOWN 
E X — Mo 


ovn 


z (9.1) 


The key question for a lower tail test is, How small must the test statistic z be before we 
choose to reject the null hypothesis? Two approaches can be used to answer this question: 
the p-value approach and the critical value approach. 


p-Value Approach The p-value approach uses the value of the test statistic z to compute a 
probability called a p-value. 


p-VALUE 


A p-value is a probability that provides a measure of the evidence against the null hy- 
pothesis provided by the sample. Smaller p-values indicate more evidence against H}. 


The p-value is used to determine whether the null hypothesis should be rejected. 

Let us see how the p-value is computed and used. The value of the test statistic is used 
to compute the p-value. The method used depends on whether the test is a lower tail, an up- 
per tail, or a two-tailed test. For a lower tail test, the p-value is the probability of obtaining 
a value for the test statistic as small as or smaller than that provided by the sample. Thus, 
to compute the p-value for the lower tail test in the ø known case, we must find, using the 
standard normal distribution, the probability that z is less than or equal to the value of the 
test statistic. After computing the p-value, we must then decide whether it is small enough 
to reject the null hypothesis; as we will show, this decision involves comparing the p-value 
to the level of significance. 

Let us now compute the p-value for the Hilltop Coffee lower tail test. Suppose the 
sample of 36 Hilltop coffee cans provides a sample mean of x = 2.92 pounds. Is x = 2.92 
small enough to cause us to reject H)? Because this is a lower tail test, the p-value is the 
area under the standard normal curve for values of z <= the value of the test statistic. Using 
x = 2.92, 0 = .18, and n = 36, we compute the value of the test statistic z. 


X- m 292-3 
a/Vn 18/36 


Zz —2.67 
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FIGURE 9.2 p-Value for the Hilltop Coffee Study When x = 2.92 


and z = —2.67 


Sampling distribution n= r = {08} 
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of z= 


Thus, the p-value is the probability that z is less than or equal to —2.67 (the lower tail area 
corresponding to the value of the test statistic). 

Using the standard normal probability table, we find that the lower tail area at z = —2.67 
is .0038. Figure 9.2 shows that x = 2.92 corresponds to z = —2.67 and a p-value = .0038. 
This p-value indicates a small probability of obtaining a sample mean of x = 2.92 (and 
a test statistic of —2.67) or smaller when sampling from a population with u = 3. This 
p-value does not provide much support for the null hypothesis, but is it small enough to 
cause us to reject H)? The answer depends upon the level of significance for the test. 

As noted previously, the director of the FTC’s testing program selected a value of .01 
for the level of significance. The selection of a = .01 means that the director is willing to 
tolerate a probability of .01 of rejecting the null hypothesis when it is true as an equality 
(uo = 3). The sample of 36 coffee cans in the Hilltop Coffee study resulted in a p-value = 
.0038, which means that the probability of obtaining a value of x = 2.92 or less when 
the null hypothesis is true as an equality is .0038. Because .0038 is less than or equal to 

= .01, we reject H,. Therefore, we find sufficient statistical evidence to reject the null 
hypothesis at the .01 level of significance. 

We can now state the general rule for determining whether the null hypothesis can be 
rejected when using the p-value approach. For a level of significance a, the rejection rule 
using the p-value approach is as follows. 


REJECTION RULE USING p-VALUE 


Reject H, if p-value = a 
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FIGURE 9.3 Critical Value = —2.33 for the Hilltop Coffee Hypothesis Test 


Sampling distribution of 
ae T Ho 


o/Vn 


a= 


In the Hilltop Coffee test, the p-value of .0038 resulted in the rejection of the null 
hypothesis. Although the basis for making the rejection decision involves a comparison of 
the p-value to the level of significance specified by the FTC director, the observed p-value 
of .0038 means that we would reject H, for any value of œ = .0038. For this reason, the 
p-value is also called the observed level of significance. 

Different decision makers may express different opinions concerning the cost of making a 
Type I error and may choose a different level of significance. By providing the p-value as part of 
the hypothesis testing results, another decision maker can compare the reported p-value to his or 
her own level of significance and possibly make a different decision with respect to rejecting Hp. 


Critical Value Approach The critical value approach requires that we first determine a 
value for the test statistic called the critical value. For a lower tail test, the critical value 
serves as a benchmark for determining whether the value of the test statistic is small 
enough to reject the null hypothesis. It is the value of the test statistic that corresponds to 
an area of a (the level of significance) in the lower tail of the sampling distribution of the 
test statistic. In other words, the critical value is the largest value of the test statistic that 
will result in the rejection of the null hypothesis. Let us return to the Hilltop Coffee exam- 
ple and see how this approach works. 

In the ø known case, the sampling distribution for the test statistic z is a standard normal 
distribution. Therefore, the critical value is the value of the test statistic that corresponds 
to an area of a = .01 in the lower tail of a standard normal distribution. Using the standard 
normal probability table, we find that z = —2.33 provides an area of .01 in the lower tail 
(see Figure 9.3). Thus, if the sample results in a value of the test statistic that is less than 
or equal to —2.33, the corresponding p-value will be less than or equal to .01; in this case, 
we should reject the null hypothesis. Hence, for the Hilltop Coffee study the critical value 
rejection rule for a level of significance of .01 is 


Reject Hy if z = —2.33 


In the Hilltop Coffee example, x = 2.92 and the test statistic is z = —2.67. Because 

z = —2.67 < —2.33, we can reject H, and conclude that Hilltop Coffee is underfilling cans. 
We can generalize the rejection rule for the critical value approach to handle any level 

of significance. The rejection rule for a lower tail test follows. 


REJECTION RULE FOR A LOWER TAIL TEST: CRITICAL VALUE APPROACH 
Reject Hitz =e, 


where —z, is the critical value; that is, the z value that provides an area of a in the 
lower tail of the standard normal distribution. 
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Summary The p-value approach to hypothesis testing and the critical value approach will 
always lead to the same rejection decision; that is, whenever the p-value is less than or 
equal to a, the value of the test statistic will be less than or equal to the critical value. The 
advantage of the p-value approach is that the p-value tells us how significant the results are 
(the observed level of significance). If we use the critical value approach, we only know 
that the results are significant at the stated level of significance. 

At the beginning of this section, we said that one-tailed tests about a population mean 
take one of the following two forms: 


Lower Tail Test Upper Tail Test 
Ho: u = Mo Ho: u S bo 
A: M< Mo Ay > Mo 


We used the Hilltop Coffee study to illustrate how to conduct a lower tail test. We can use 
the same general approach to conduct an upper tail test. The test statistic z is still computed 
using equation (9.1). But, for an upper tail test, the p-value is the probability of obtaining 
a value for the test statistic as large as or larger than that provided by the sample. Thus, to 
compute the p-value for the upper tail test in the o known case, we must use the standard 
normal distribution to compute the probability that z is greater than or equal to the value of 
the test statistic. Using the critical value approach causes us to reject the null hypothesis if 
the value of the test statistic is greater than or equal to the critical value z,; in other words, 
we reject Hg if z = Z,- 

Let us summarize the steps involved in computing p-values for one-tailed hypothesis tests. 


COMPUTATION OF p-VALUES FOR ONE-TAILED TESTS 


1. Compute the value of the test statistic using equation (9.1). 

2. Lower tail test: Using the standard normal distribution, compute the probability 
that z is less than or equal to the value of the test statistic (area in the lower tail). 

3. Upper tail test: Using the standard normal distribution, compute the proba- 
bility that z is greater than or equal to the value of the test statistic (area in the 
upper tail). 


Two-Tailed Test 


In hypothesis testing, the general form for a two-tailed test about a population mean is 
as follows: 


Ay: u = Mo 

Ay: M F Mo 
In this subsection we show how to conduct a two-tailed test about a population mean for 
the ø known case. As an illustration, we consider the hypothesis testing situation facing 
MaxFlight, Inc. 

The U.S. Golf Association (USGA) establishes rules that manufacturers of golf equip- 
ment must meet if their products are to be acceptable for use in USGA events. MaxFlight 
uses a high-technology manufacturing process to produce golf balls with a mean driving 
distance of 295 yards. Sometimes, however, the process gets out of adjustment and pro- 
duces golf balls with a mean driving distance different from 295 yards. When the mean dis- 
tance falls below 295 yards, the company worries about losing sales because the golf balls 
do not provide as much distance as advertised. When the mean distance passes 295 yards, 
MaxFlight’s golf balls may be rejected by the USGA for exceeding the overall distance 
standard concerning carry and roll. 

MaxFlight’s quality control program involves taking periodic samples of 50 golf balls 
to monitor the manufacturing process. For each sample, a hypothesis test is conducted to 
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determine whether the process has fallen out of adjustment. Let us develop the null and 
alternative hypotheses. We begin by assuming that the process is functioning correctly; 
that is, the golf balls being produced have a mean distance of 295 yards. This assumption 
establishes the null hypothesis. The alternative hypothesis is that the mean distance is 
not equal to 295 yards. With a hypothesized value of uo = 295, the null and alternative 
hypotheses for the MaxFlight hypothesis test are as follows: 


Ho: u = 295 
H; u # 295 


If the sample mean x is significantly less than 295 yards or significantly greater than 
295 yards, we will reject Hp. In this case, corrective action will be taken to adjust the 
manufacturing process. On the other hand, if x does not deviate from the hypothesized 
mean py = 295 by a significant amount, H, will not be rejected and no action will be taken 
to adjust the manufacturing process. 

The quality control team selected a = .05 as the level of significance for the test. Data 
from previous tests conducted when the process was known to be in adjustment show that 
the population standard deviation can be assumed known with a value of o = 12. Thus, 
with a sample size of n = 50, the standard error of x is 


o 12 
ooe n = 
* yn v50 


The Central Limit Theorem Because the sample size is large, the central limit theorem allows us to conclude that the 
is discussed in Chapter 7.. sampling distribution of x can be approximated by a normal distribution. Figure 9.4 shows 
the sampling distribution of x for the MaxFlight hypothesis test with a hypothesized popu- 
lation mean of ug = 295. 
Suppose that a sample of 50 golf balls is selected and that the sample mean is 
DATA file x = 297.6 yards. This sample mean provides support for the conclusion that the popu- 
lation mean is larger than 295 yards. Is this value of x enough larger than 295 to cause 
us to reject H, at the .05 level of significance? In the previous section we described 
two approaches that can be used to answer this question: the p-value approach and the 
critical value approach. 


17 


(@ 


GolfTest 


p-Value Approach Recall that the p-value is a probability used to determine whether 

the null hypothesis should be rejected. For a two-tailed test, values of the test statistic in 
either tail provide evidence against the null hypothesis. For a two-tailed test, the p-value is 
the probability of obtaining a value for the test statistic as unlikely as or more unlikely than 
that provided by the sample. Let us see how the p-value is computed for the MaxFlight 
hypothesis test. 


FIGURE 9.4 Sampling Distribution of x for the MaxFlight Hypothesis Test 


Sampling distribution 
of X 
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FIGURE 9.5 _ p-Value for the MaxFlight Hypothesis Test 


P(z = — 1.53) = .0630 IA = i333) = E 


0 


p-value = 2(.0630) = .1260 


First we compute the value of the test statistic. For the øo known case, the test statistic z 
is a standard normal random variable. Using equation (9.1) with x = 297.6, the value of the 
test statistic is 


_X¥— py _ 297.6 — 295 _ 
a/Vn  12/v50 


Now to compute the p-value we must find the probability of obtaining a value for the test 
statistic at least as unlikely as z = 1.53. Clearly values of z = 1.53 are at least as unlikely. 
But, because this is a two-tailed test, values of z = — 1.53 are also at least as unlikely as 
the value of the test statistic provided by the sample. In Figure 9.5, we see that the two- 
tailed p-value in this case is given by P(z = —1.53) + P(z = 1.53). Because the normal 
curve is symmetric, we can compute this probability by finding P(z = 1.53) and doubling 
it. The table for the standard normal distribution shows that P(z < 1.53) = .9370. Thus, the 
upper tail area is P(z = 1.53) = 1.0000 — .9370 = .0630. Doubling this, we find that the 
p-value for the MaxFlight two-tailed hypothesis test is p-value = 2(.0630) = .1260. 

Next we compare the p-value to the level of significance to see whether the null 
hypothesis should be rejected. With a level of significance of a = .05, we do not reject Hg 
because the p-value = .1260 > .05. Because the null hypothesis is not rejected, no action 
will be taken to adjust the MaxFlight manufacturing process. 

Let us summarize the steps involved in computing p-values for two-tailed hypothesis tests. 


z 1.53 


COMPUTATION OF p-VALUES FOR TWO-TAILED TESTS 


1. Compute the value of the test statistic using equation (9.1). 

2. If the value of the test statistic is in the upper tail, compute the probability that 
z is greater than or equal to the value of the test statistic (the upper tail area). If 
the value of the test statistic is in the lower tail, compute the probability that z is 
less than or equal to the value of the test statistic (the lower tail area). 

3. Double the probability (or tail area) from step 2 to obtain the p-value. 


Critical Value Approach Before leaving this section, let us see how the test statistic z can 
be compared to a critical value to make the hypothesis testing decision for a two-tailed test. 
Figure 9.6 shows that the critical values for the test will occur in both the lower and upper 
tails of the standard normal distribution. With a level of significance of a = .05, the area 
in each tail corresponding to the critical values is a/2 = .05/2 = .025. Using the standard 
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FIGURE 9.6 Critical Values for the MaxFlight Hypothesis Test 


Area = .025 Area = .025 


LG 0 1.96 


Reject Ho Reject Ho 


normal probability table, we find the critical values for the test statistic are —Z 9), = — 1.96 
and Z 5 = 1.96. Thus, using the critical value approach, the two-tailed rejection rule is 


Reject Hy if z= —1.96 or if z = 1.96 


Because the value of the test statistic for the MaxFlight study is z = 1.53, the statistical 
evidence will not permit us to reject the null hypothesis at the .05 level of significance. 


Excel can be used to conduct one-tailed and two-tailed hypothesis tests about a population 
mean for the ø known case using the p-value approach. Recall that the method used to 
compute a p-value depends upon whether the test is lower tail, upper tail, or two-tailed. 
Therefore, in the Excel procedure we describe we will use the sample results to compute 
three p-values: p-value (Lower Tail), p-value (Upper Tail), and p-value (Two Tail). The user 
can then choose «œ and draw a conclusion using whichever p-value is appropriate for the 
type of hypothesis test being conducted. We will illustrate using the MaxFlight two-tailed 
hypothesis test. Refer to Figure 9.7 as we describe the tasks involved. The formula work- 
sheet is in the background; the value worksheet is in the foreground. 


Enter/Access Data: Open the file GolfTest. A label and the distance data for the sample of 
50 golf balls are entered in cells A1:A51. 


Enter Functions and Formulas: The sample size and sample mean are computed in 

cells D4 and D5 using Excel’s COUNT and AVERAGE functions, respectively. The value 
worksheet shows that the sample size is 50 and the sample mean is 297.6. The value of the 
known population standard deviation (/2) is entered in cell D7, and the hypothesized value 
of the population mean (295) is entered in cell D8. 


The standard error is obtained in cell D10 by entering the formula =D7/SQRT(D4). 
The formula =(D5-D8)/D10 in cell D11 computes the test statistic z (1.5321). To compute 
the p-value for a lower tail test, we enter the formula =NORM.S.DIST(D11,TRUE) in cell 
D13. The p-value for an upper tail test is then computed in cell D14 as 1 minus the p-value 
for the lower tail test. Finally, the p-value for a two-tailed test is computed in cell D15 as two 
times the minimum of the two one-tailed p-values. The value worksheet shows that p-value 
(Lower Tail) = 0.9372, p-value (Upper Tail) = 0.0628, and p-value (Two Tail) = 0.1255. 

The development of the worksheet is now complete. For the two-tailed MaxFlight prob- 
lem we cannot reject Hy: u = 295 using a = .05 because the p-value (Two Tail) = 0.1255 
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FIGURE 9.7 | Excel Worksheet: Hypothesis Test for the o Known Case 


AA 8| c : DE ME 
1 | Yards | Hypothesis Test about a Population Mean: 
2 g Known Case 
3 L 
4 Sample Size =COUNT(A2:A51) pm B joe i 5 lE 
> pie Mesa SASEROOOUNRDSA Al Hypothesis Test about a Population Mean: 
Sp DATA f l 7 Population Standard Deviation 12 : a Known Case 
w = lle z Hypothesized Value 295 ; EE s0 
Golffest 10 Standard Error =D7/SQRT(D4) = Semple Mean S 
S Test Statistic z =(D5-D8)/D10 = P ET ERA T 
13 p-value (Lower Tail) =NORM.S.DIST(D11.TRUE) : Hypothesized Value 295 
14 p-value (Upper Tail) =1-D13 
f ; 10 Standard Error 1.6971 
z a a "1 Test Statistic z 1.5321 
12 
2 13. p-value (Lower Tail) 0.9372 
5 14 p-value (Upper Tail) 0.0628 
15 p-value (Two Tail) 0.1255 
Note: Rows 17-49 K 
are hidden. 51 
52 


The file GolfTest includes 
a worksheet entitled 
Template that uses the 
A:A method for entering 
the data ranges. 
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is greater than a. Thus, the quality control manager has no reason to doubt that the manu- 
facturing process is producing golf balls with a population mean distance of 295 yards. 


A Template for Other Problems The worksheet in Figure 9.7 can be used as a template 
for conducting any one-tailed and two-tailed hypothesis tests for the o known case. Just 
enter the appropriate data in column A, adjust the ranges for the formulas in cells D4 and 
DS, enter the population standard deviation in cell D7, and enter the hypothesized value 
in cell D8. The standard error, the test statistic, and the three p-values will then appear. 
Depending on the form of the hypothesis test (lower tail, upper tail, or two-tailed), we can 
then choose the appropriate p-value to make the rejection decision. 

We can further simplify the use of Figure 9.7 as a template for other problems by elimi- 
nating the need to enter new data ranges in cells D4 and DS. To do so we rewrite the cell 
formulas as follows: 


Cell D4: 
Cell D5: 


=COUNT(A:A) 
=AVERAGE(A.-A) 


With the A:A method of specifying data ranges, Excel’s COUNT function will count the 
number of numerical values in column A and Excel’s AVERAGE function will compute 
the average of the numerical values in column A. Thus, to solve a new problem it is only 
necessary to enter the new data in column A, enter the value of the known population 
standard deviation in cell D7, and enter the hypothesized value of the population mean in 
cell D8. 

The worksheet can also be used as a template for text exercises in which n, x, and o 
are given. Just ignore the data in column A and enter the values for n, x, and ø in cells D4, 
D5, and D7, respectively. Then enter the appropriate hypothesized value for the popula- 
tion mean in cell D8. The p-values corresponding to lower tail, upper tail, and two-tailed 
hypothesis tests will then appear in cells D13:D15. 


Summary and Practical Advice 


We presented examples of a lower tail test and a two-tailed test about a population mean. 
Based upon these examples, we can now summarize the hypothesis testing procedures 
about a population mean for the ø known case as shown in Table 9.2. Note that ug is the 
hypothesized value of the population mean. 
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TABLE 9.2 Summary of Hypothesis Tests About a Population Mean: ø Known Case 


Lower Tail Test Upper Tail Test Two-Tailed Test 
Eei Ho: u = Mo Ho: u = Mo Ho: u = Mo 
ypotheses Hiema fa H,: u > bo Ha: u * Mo 
1. Statisti pe Sats pe aos pete 
est Statistic o/Vn o/Vn o/Vn 
Rejection Rule: Reject H, if Reject Ho if Reject Ho if 
p-Value Approach p-value = a p-value < a p-value < a 
Rejection Rule: Reject Hp if Reject Hp if Reject Hp if 
Critical Value BS, AE ZS -Z 
Approach or if z = 242 


The hypothesis testing steps followed in the two examples presented in this section are 
common to every hypothesis test. 


STEPS OF HYPOTHESIS TESTING 


Step 1. Develop the null and alternative hypotheses. 
Step 2. Specify the level of significance. 
Step 3. Collect the sample data and compute the value of the test statistic. 


p-Value Approach 


Step 4. Use the value of the test statistic to compute the p-value. 
Step 5. Reject H; if the p-value = a. 
Step 6. Interpret the statistical conclusion in the context of the application. 


Critical Value Approach 


Step 4. Use the level of significance to determine the critical value and the rejection rule. 

Step 5. Use the value of the test statistic and the rejection rule to determine 
whether to reject Hp. 

Step 6. Interpret the statistical conclusion in the context of the application. 


Practical advice about the sample size for hypothesis tests is similar to the advice we 
provided about the sample size for interval estimation in Chapter 8. In most applications, a 
sample size of n = 30 is adequate when using the hypothesis testing procedure described 
in this section. In cases where the sample size is less than 30, the distribution of the popula- 
tion from which we are sampling becomes an important consideration. If the population is 
normally distributed, the hypothesis testing procedure that we described is exact and can be 
used for any sample size. If the population is not normally distributed but is at least roughly 
symmetric, sample sizes as small as 15 can be expected to provide acceptable results. 


Relationship Between Interval Estimation 
and Hypothesis Testing 


In Chapter 8 we showed how to develop a confidence interval estimate of a population mean. 
For the ø known case, the (1 — a)% confidence interval estimate of a population mean is 
given by 
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In this chapter we showed that a two-tailed hypothesis test about a population mean 
takes the following form: 


Ay: u = Mo 
A: M F Wo 
where mọ is the hypothesized value for the population mean. 

Suppose that we follow the procedure described in Chapter 8 for constructing a 
100(1 — a)% confidence interval for the population mean. We know that 100(1 — a)% 
of the confidence intervals generated will contain the population mean and 100a% of 
the confidence intervals generated will not contain the population mean. Thus, if we 
reject H) whenever the confidence interval does not contain mo, we will be rejecting the 
null hypothesis when it is true (u = mọ) with probability a. Recall that the level of sig- 
nificance is the probability of rejecting the null hypothesis when it is true. So construct- 
ing a 100(1 — a@)% confidence interval and rejecting H) whenever the interval does not 
contain fo is equivalent to conducting a two-tailed hypothesis test with a as the level 
of significance. The procedure for using a confidence interval to conduct a two-tailed 
hypothesis test can now be summarized. 


A CONFIDENCE INTERVAL APPROACH TO TESTING A HYPOTHESIS OF THE FORM 


Ay: M = Mo 
H; WF Wo 
1. Select a simple random sample from the population and use the value of the 
sample mean x to develop the confidence interval for the population mean wp. 
X= p Z- 
For a two-tailed hypothesis Vn 


test, the:null hypothesis:can 2. If the confidence interval contains the hypothesized value mọ, do not reject Hp. 
be rejected if the confidence Otherwise, rej ae Hy. 
interval does not include pọ. 


Let us illustrate by conducting the MaxFlight hypothesis test using the confidence 
interval approach. The MaxFlight hypothesis test takes the following form: 
Hy: u = 295 
H; u +295 
To test these hypotheses with a level of significance of a = .05, we sampled 50 golf balls 
and found a sample mean distance of x = 297.6 yards. Recall that the population standard 


deviation is o = 12. Using these results with zo; = 1.96, we find that the 95% confidence 
interval estimate of the population mean is 


y+ 
X= Z 095 


Sls 


12 
297.6 = 1.96 —— 
V50 


297.6 = 3.3 
or 
294.3 to 300.9 
This finding enables the quality control manager to conclude with 95% confidence 


that the mean distance for the population of golf balls is between 294.3 and 300.9 yards. 


3To be consistent with the rule for rejecting Hy when the p-value < a, we would also reject H} using the confidence 
interval approach if uo happens to be equal to one of the endpoints of the 100(1 — a)% confidence interval. 
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Because the hypothesized value for the population mean, fy = 295, is in this interval, 
the hypothesis testing conclusion is that the null hypothesis, Hy: w = 295, cannot 
be rejected. 

Note that this discussion and example pertain to two-tailed hypothesis tests about a 
population mean. However, the same confidence interval and two-tailed hypothesis testing 
relationship exists for other population parameters. The relationship can also be extended 
to one-tailed tests about population parameters. Doing so, however, requires the develop- 
ment of one-sided confidence intervals, which are rarely used in practice. 


NOTES + COMMENTS 


1. We have shown how to use p-values. The smaller the 
p-value the greater the evidence against H) and the more 
the evidence in favor of H,. Here are some guidelines stat- 
isticians suggest for interpreting small p-values. 

e Less than .01—Overwhelming evidence to conclude 
that H, is true 
e Between .01 and .05—Strong evidence to conclude 


e Greater than .10—Insufficient evidence to conclude 
that H, is true 

2. When testing a hypothesis of the population mean with 

a sample size that is at least 5% of the population size 

(that is, n/N = .05), the finite population correction factor 

should be used when calculating the standard error of the 


sampling distribution of X when a is known, that is, 


that H, is true 
e Between .05 and .10—Weak evidence to conclude 
that H, is true 


= N-nf o. 
=N] J) 


EXERCISES 


Note to Student: Some of the exercises that follow ask you to use the p-value approach 
and others ask you to use the critical value approach. Both methods will provide the 
same hypothesis testing conclusion. We provide exercises with both methods to give you 
practice using both. In later sections and in following chapters, we will generally empha- 
size the p-value approach as the preferred method, but you may select either based on 
personal preference. 


Methods 
9. Consider the following hypothesis test: 
Ay: w = 20 
H; u < 20 


A sample of 50 provided a sample mean of 19.4. The population standard deviation is 2. 
a. Compute the value of the test statistic. 
b. What is the p-value? 
c. Using a = .05, what is your conclusion? 
d. What is the rejection rule using the critical value? What is your conclusion? 
10. Consider the following hypothesis test: 


Ao: w = 25 
Ao ye > 25 


A sample of 40 provided a sample mean of 26.4. The population standard deviation is 6. 
. Compute the value of the test statistic. 

b. What is the p-value? 

c. Ata = .01, what is your conclusion? 

d. What is the rejection rule using the critical value? What is your conclusion? 


fo) 
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11. Consider the following hypothesis test: 
A: w = 15 
H,: w # 15 
A sample of 50 provided a sample mean of 14.15. The population standard deviation 
is 3. 
a. Compute the value of the test statistic. 
b. What is the p-value? 
c. Ata = .05, what is your conclusion? 


d. What is the rejection rule using the critical value? What is your conclusion? 
12. Consider the following hypothesis test: 


Ho: u = 80 
Hy u < 80 


A sample of 100 is used and the population standard deviation is 12. Compute the 
p-value and state your conclusion for each of the following sample results. Use 


a = 0l. 

a. x= 78.5 

b. x= 77 

c. x = 75.5 

d. x= 81 

13. Consider the following hypothesis test: 

Hg: w = 50 
H: w > 50 


A sample of 60 is used and the population standard deviation is 8. Use the critical 
value approach to state your conclusion for each of the following sample results. Use 


a = .05. 
a. x = 52.5 
b. x =51 
c. x = 51.8 
14. Consider the following hypothesis test: 
Ay: p = 22 
Hy: p # 22 


A sample of 75 is used and the population standard deviation is 10. Compute the 


p-value and state your conclusion for each of the following sample results. Use 
a= .01. 


Applications 

15. Federal Tax Returns. According to the IRS, individuals filing federal income tax 
returns prior to March 31 received an average refund of $1056 in 2018. Consider the 
population of “last-minute” filers who mail their tax return during the last five days of 

the income tax period (typically April 10 to April 15). 

a. A researcher suggests that a reason individuals wait until the last five days is that on av- 
erage these individuals receive lower refunds than do early filers. Develop appropriate 
hypotheses such that rejection of H, will support the researcher’s contention. 

b. For a sample of 400 individuals who filed a tax return between April 10 and 15, the 
sample mean refund was $910. Based on prior experience, a population standard 
deviation of o = $1600 may be assumed. What is the p-value? 
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c. Ata = .05, what is your conclusion? 

d. Repeat the preceding hypothesis test using the critical value approach. 

16. Credit Card Use by Undergraduates. In a study entitled How Undergraduate Stu- 
dents Use Credit Cards, Sallie Mae reported that undergraduate students have a mean 
credit card balance of $3173. This figure was an all-time high and had increased 44% 
over the previous five years. Assume that a current study is being conducted to deter- 
mine whether it can be concluded that the mean credit card balance for undergraduate 
students has continued to increase compared to the April 2009 report. Based on previ- 
ous studies, use a population standard deviation o = $1000. 

a. State the null and alternative hypotheses. 

b. What is the p-value for a sample of 180 undergraduate students with a sample mean 
credit card balance of $3325? 

c. Using a .05 level of significance, what is your conclusion? 

17. Use of Texting. TextRequest reports that adults 18—24 years old send and receive 
128 texts every day. Suppose we take a sample of 25-34 year olds to see if their 
mean number of daily texts differs from the mean for 18—24 year olds reported by 
TextRequest. 

a. State the null and alternative hypotheses we should use to test whether the popu- 
lation mean daily number of texts for 25-34 year olds differs from the population 
daily mean number of texts for 18—24 year olds. 

b. Suppose a sample of thirty 25-34 year olds showed a sample mean of 118.6 texts 
per day. Assume a population standard deviation of 33.17 texts per day and com- 
pute the p-value. 

c. With a = .05 as the level of significance, what is your conclusion? 

d. Repeat the preceding hypothesis test using the critical value approach. 

18. CPA Work Hours. The American Institute of Certified Tax Planners reports that the 
average U.S. CPA works 60 hours per week during tax season. Do CPAs in states that 
have flat state income tax rates work fewer hours per week during tax season? Conduct 
a hypothesis test to determine if this is so. 

a. Formulate hypotheses that can be used to determine whether the mean hours 
worked per week during tax season by CPAs in states that have flat state income 
tax rates is less than the mean hours worked per week by all U.S. CPAs during tax 
season? 

b. Based on a sample, the mean number of hours worked per week during tax season 
by CPAs in states with flat tax rates was 55. Assume the sample size was 150 and 
that, based on past studies, the population standard deviation can be assumed to be 
o = 27.4. Use the sample results to compute the test statistic and p-value for your 
hypothesis test. 

c. Ata = .05, what is your conclusion? 

19. Length of Calls to the IRS. According to the IRS, taxpayers calling the IRS in 2017 
waited 13 minutes on average for an IRS telephone assister to answer. Do callers 
who use the IRS help line early in the day have a shorter wait? Suppose a sample of 
50 callers who placed their calls to the IRS in the first 30 minutes that the line is open 
during the day have a mean waiting time of 11 minutes before an IRS telephone assister 
answers. Based on data from past years, you decide that it is reasonable to assume that 
the standard deviation of waiting times is 8 minutes. Using these sample results, can 
you conclude that the waiting time for calls placed during the first 30 minutes the IRS 
help line is open each day is significantly less than the overall mean waiting time of 
13 minutes? Use a = .05. 

20. Prescription Drug Costs. According to the Hospital Care Cost Institute, the annual 
expenditure for prescription drugs is $838 per person in the Northeast region of the 
country. A sample of 60 individuals in the Midwest shows a per person annual expen- 
diture for prescription drugs of $745. Use a population standard deviation of $300 to 
answer the following questions. 
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a. Formulate hypotheses for a test to determine whether the sample data support the 
conclusion that the population annual expenditure for prescription drugs per person 
is lower in the Midwest than in the Northeast. 

b. What is the value of the test statistic? 

c. What is the p-value? 

d. Ata = .01, what is your conclusion? 

DATA fil 21. Cost of Telephone Surveys. Fowle Marketing Research, Inc. bases charges to a client on 

Í e the assumption that telephone surveys can be completed in a mean time of 15 minutes or 
less. If a longer mean survey time is necessary, a premium rate is charged. A sample of 
35 surveys provided the survey times shown in the file Fowle. Based upon past studies, 
the population standard deviation is assumed known with o = 4 minutes. Is the premium 
rate justified? 

a. Formulate the null and alternative hypotheses for this application. 
b. Compute the value of the test statistic. 

c. What is the p-value? 

d. Ata = .01, what is your conclusion? 

22. Time in Supermarket Checkout Lines. CCN and ActMedia provided a television 
channel targeted to individuals waiting in supermarket checkout lines. The channel 
showed news, short features, and advertisements. The length of the program was based 
on the assumption that the population mean time a shopper stands in a supermarket 
checkout line is 8 minutes. A sample of actual waiting times will be used to test this 
assumption and determine whether actual mean waiting time differs from this standard. 
a. Formulate the hypotheses for this application. 

b. A sample of 120 shoppers showed a sample mean waiting time of 8.4 minutes. 

Assume a population standard deviation of o = 3.2 minutes. What is the p-value? 
c. Ata = .05, what is your conclusion? 

d. Compute a 95% confidence interval for the population mean. Does it support your 
conclusion? 


(@ 


Fowle 


9.4 Population Mean: ø Unknown 


In this section we describe how to conduct hypothesis tests about a population mean for 
the ø unknown case. Because the ø unknown case corresponds to situations in which an 
estimate of the population standard deviation cannot be developed prior to sampling, the 
sample must be used to develop an estimate of both u and ø. Thus, to conduct a hypothe- 
sis test about a population mean for the o unknown case, the sample mean x is used as an 
estimate of u and the sample standard deviation s is used as an estimate of ø. 

The steps of the hypothesis testing procedure for the o unknown case are the same as 
those for the øo known case described in Section 9.3. But, with o unknown, the compu- 
tation of the test statistic and p-value is a bit different. Recall that for the o known case, 
the sampling distribution of the test statistic has a standard normal distribution. For the ø 
unknown case, however, the sampling distribution of the test statistic follows the ż¢ distribu- 
tion; it has slightly more variability because the sample is used to develop estimates of both 
pando. 

In Section 8.2 we showed that an interval estimate of a population mean for the o 
unknown case is based on a probability distribution known as the ¢ distribution. Hypothesis 
tests about a population mean for the ø unknown case are also based on the ż distribution. 
For the ø unknown case, the test statistic has a ¢ distribution with n — 1 degrees of freedom. 


TEST STATISTIC FOR HYPOTHESIS TESTS ABOUT A POPULATION MEAN: 
o UNKNOWN 


xX — Mo 


_ s/Vn 


t (9.2) 
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In Chapter 8 we said that the ¢ distribution is based on an assumption that the population 
from which we are sampling has a normal distribution. However, research shows that this 
assumption can be relaxed considerably when the sample size is large enough. We provide 
some practical advice concerning the population distribution and sample size at the end of 
the section. 


One-Tailed Test 


Let us consider an example of a one-tailed test about a population mean for the o unknown 
case. A business travel magazine wants to classify transatlantic gateway airports accord- 
ing to the mean rating for the population of business travelers. A rating scale with a low 
score of 0 and a high score of 10 will be used, and airports with a population mean rating 
DATA fi le greater than 7 will be designated as superior service airports. The magazine staff surveyed 
a sample of 60 business travelers at each airport to obtain the ratings data. The sample 
for London’s Heathrow Airport provided a sample mean rating of x = 7.25 and a sample 
standard deviation of s = 1.052. Do the data indicate that Heathrow should be designated 
as a superior service airport? 

We want to develop a hypothesis test for which the decision to reject Hp will lead to 
the conclusion that the population mean rating for the Heathrow Airport is greater than 7. 
Thus, an upper tail test with H,: > 7 is required. The null and alternative hypotheses for 
this upper tail test are as follows: 


(@ 


AirRating 


Ay: w= 7 
Ay w>7 


We will use a = .05 as the level of significance for the test. 

Using equation (9.2) with x = 7.25, wy) = 7, s = 1.052, and n = 60, the value of the test 
statistic is 

X — [hy Tad =i 
© s/Vn  1.052/v60 


The sampling distribution of t has n — 1 = 60 — 1 = 59 degrees of freedom. Because the 
test is an upper tail test, the p-value is P(t = 1.84), that is, the upper tail area corresponding 
to the value of the test statistic. 

The ż distribution table provided in most textbooks will not contain sufficient detail to 
determine the exact p-value, such as the p-value corresponding to t = 1.84. For instance, 
using Table 2 in Appendix B (available online), the ¢ distribution with 59 degrees of free- 
dom provides the following information. 


t 1.84 


Area in Upper Tail .20 .10 .05 .025 01 005 
t Value (59 df) 848 1.296 1.671 2.001 2.391 2.662 
t= 1.84 


We see that ¢ = 1.84 is between 1.671 and 2.001. Although the table does not provide the 
exact p-value, the values in the “Area in Upper Tail” row show that the p-value must be 
less than .05 and greater than .025. With a level of significance of a = .05, this placement 
is all we need to know to make the decision to reject the null hypothesis and conclude that 
Heathrow should be classified as a superior service airport. 

It is cumbersome to use a f table to compute p-values, and only approximate values are 
obtained. We describe how to compute exact p-values using Excel’s T.DIST function in the 
Using Excel subsection which follows. The exact upper tail p-value for the Heathrow Air- 
port hypothesis test is .0354. With .0354 < .05, we reject the null hypothesis and conclude 
that Heathrow should be classified as a superior service airport. 

The decision whether to reject the null hypothesis in the o unknown case can also be made 
using the critical value approach. The critical value corresponding to an area of a = .05 in 
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the upper tail of a ¢ distribution with 59 degrees of freedom is tọ, = 1.671. Thus the rejection 
rule using the critical value approach is to reject H, if t = 1.671. Because t = 1.84 > 1.671, 
H, is rejected. Heathrow should be classified as a superior service airport. 


Two-Tailed Test 


To illustrate how to conduct a two-tailed test about a population mean for the ø unknown 
case, let us consider the hypothesis testing situation facing Holiday Toys. The company 
manufactures and distributes its products through more than 1000 retail outlets. In plan- 
ning production levels for the coming winter season, Holiday must decide how many 
units of each product to produce prior to knowing the actual demand at the retail level. 
For this year’s most important new toy, Holiday’s marketing director is expecting demand 
to average 40 units per retail outlet. Prior to making the final production decision based 
upon this estimate, Holiday decided to survey a sample of 25 retailers in order to develop 
more information about the demand for the new product. Each retailer was provided with 
information about the features of the new toy along with the cost and the suggested selling 
price. Then each retailer was asked to specify an anticipated order quantity. 

With u denoting the population mean order quantity per retail outlet, the sample data 
will be used to conduct the following two-tailed hypothesis test: 


Hy: u = 40 
Hx u # 40 


If H, cannot be rejected, Holiday will continue its production planning based on the 
marketing director’s estimate that the population mean order quantity per retail outlet 
will be u = 40 units. However, if H, is rejected, Holiday will immediately reevaluate its 
production plan for the product. A two-tailed hypothesis test is used because Holiday wants 
to reevaluate the production plan if the population mean quantity per retail outlet is less 
than anticipated or greater than anticipated. Because no historical data are available (it’s a 
new product), the population mean u and the population standard deviation must both be 
estimated using x and s from the sample data. 
. The sample of 25 retailers provided a mean of x = 37.4 and a standard deviation of 
DATA f ile s = 11.79 units. Before going ahead with the use of the ¢ distribution, the analyst constructed 
a histogram of the sample data in order to check on the form of the population distribution. 
The histogram of the sample data showed no evidence of skewness or any extreme outliers, 
so the analyst concluded that the use of the t distribution with n — 1 = 24 degrees of free- 
dom was appropriate. Using equation (9.2) with x = 37.4, uy = 40, s = 11.79, and n = 25, 
the value of the test statistic is 


(@ 


Orders 


X— Ho 374-40 
s/Vn 11.79/25 
Because we have a two-tailed test, the p-value is two times the area under the curve 


of the ¢ distribution for t = — 1.10. Using Table 2 in Appendix B (available online), the t 
distribution table for 24 degrees of freedom provides the following information. 


t —1.10 


Area in Upper Tail .20 10 05 025 01 005 


t-Value (24 df) 857 1.318 1.711 2.064 2.492 2.797 


t= 1.10 


The ¢ distribution table contains only positive ¢ values (corresponding to areas in the upper 
tail). Because the f distribution is symmetric, however, the upper tail area for t = 1.10 is the 
same as the lower tail area for t = — 1.10. We see that t = 1.10 is between 0.857 and 1.318. 
From the “Area in Upper Tail” row, we see that the area in the tail to the right of t = 1.10 
is between .20 and .10. When we double these amounts, we see that the p-value must be 
between .40 and .20. With a level of significance of a = .05, we now know that the p-value is 
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greater than a. Therefore, H) cannot be rejected. Sufficient evidence is not available to con- 
clude that Holiday should change its production plan for the coming season. 

In the Using Excel subsection which follows, we show how to compute the exact 
p-value for this hypothesis test using Excel. The p-value obtained is .2811. With a level of 
significance of a = .05, we cannot reject H} because .2811 > .05. 

The test statistic can also be compared to the critical value to make the two-tailed 
hypothesis testing decision. With a = .05 and the ¢ distribution with 24 degrees of free- 
dom, —t o5 = —2.064 and ty,, = 2.064 are the critical values for the two-tailed test. The 
rejection rule using the test statistic is 


Reject Hy if tS —2.064 or if t = 2.064 


Based on the test statistic t = —1.10, H, cannot be rejected. This result indicates that 
Holiday should continue its production planning for the coming season based on the 
expectation that u = 40. 


Excel can be used to conduct one-tailed and two-tailed hypothesis tests about a popula- 
tion mean for the ø unknown case. The approach is similar to the procedure used in the o 
known case. The sample data and the test statistic (t) are used to compute three p-values: 
p-value (Lower Tail), p-value (Upper Tail), and p-value (Two Tail). The user can then 
choose a and draw a conclusion using whichever p-value is appropriate for the type of 
hypothesis test being conducted. 

Let’s start by showing how to use Excel’s T.DIST function to compute a lower tail 
p-value. The T.DIST function has three inputs; its general form is as follows: 


T.DIST(test statistic, degrees of freedom, cumulative) 


For the first input, we enter the value of the test statistic, for the second input we enter the 
number of degrees of freedom. For the third input, we enter TRUE if we want a cumulative 
probability and FALSE if we want the height of the curve. When we want to compute a 
lower tail p-value, we enter TRUE. 

Once the lower tail p-value has been computed, it is easy to compute the upper tail and 
the two-tailed p-values. The upper tail p-value is just | minus the lower tail p-value. And the 
two-tailed p-value is given by two times the smaller of the lower and upper tail p-values. 

Let us now construct an Excel worksheet to conduct the two-tailed hypothesis test for 
the Holiday Toys study. Refer to Figure 9.8 as we describe the tasks involved. The formula 
worksheet is in the background; the value worksheet is in the foreground. 


Enter/Access Data: Open the file Orders. A label and the order quantity data for the sam- 
ple of 25 retailers are entered in cells A1:A26. 


Enter Functions and Formulas: The sample size, sample mean, and sample standard 
deviation are computed in cells D4:D6 using Excel’s COUNT, AVERAGE, and STDEV.S 
functions, respectively. The value worksheet shows that the sample size is 25, the sample 
mean is 37.4, and the sample standard deviation is 11.79. The hypothesized value of the 
population mean (40) is entered in cell D8. 


Using the sample standard deviation as an estimate of the population standard 
deviation, an estimate of the standard error is obtained in cell D10 by dividing the 
sample standard deviation in cell D6 by the square root of the sample size in cell 
D4. The formula =(D5-D8)/D10 entered in cell D11 computes the test statistic 
t (— 1.1026). The degrees of freedom are computed in cell D12 as the sample size in 
cell D4 minus 1. 

To compute the p-value for a lower tail test, we enter the following formula in cell D14: 


=T.DIST(D11,D12, TRUE) 
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FIGURE 9.8 | Excel Worksheet: Hypothesis Test for the o Unknown Case 


D) ae ee = a 
1| Units | Hypothesis Test about a Population Mean: 
2) a Unknown Case 
3 
4 Sample Size "COUNT(A2:A26) Am e c D E F 
5 Sample Mean =AVERAGE(A2:A26) = 
1 Hypothesis Test about a Population Mean: 
6 S: Standard Deviation =STDEV. z 
7 enple AAO) 2 o Unknown Case 
8 Hypothesized Value 40 3 
9 4 Sample Size 25 
10 Standard Error =D6/SQRT(D4) 5 Sample Mean 37.4 
© D ATA il eu Test Statistic ¢ =(DS-D8)/D10 6 Sample Standard Deviation 11.79 
eed 7 
Seema’ 12 Degrees of Freedom =D4-1 
Orders 13 8 Hypothesized Value 40 
; 9 
14 p-value (Lower Tail) =T.DIST(D11,D12,TRUE) 
15 p-value (Upper Tail) =1-D14 i; suak Enee tie 
16 -val Tail) =2*MIN(D14,D15 meet a 
7 p-value (Two Tal) TAMINA 12 Degrees of Freedom 24 
13 
z 14 p-value (Lower Tail) 0.1406 
27 15 p-value (Upper Tail) 0.8594 
16 p-value (Two Tail) 0.2811 
Note: Rows H 
18-24 are hidden. 26 
27 


The p-value for an upper tail test is then computed in cell D15 as 1 minus the p-value 
for the lower tail test. Finally, the p-value for a two-tailed test is computed in cell D16 as 
two times the minimum of the two one-tailed p-values. The value worksheet shows that 
the three p-values are p-value (Lower Tail) = 0.1406, p-value (Upper Tail) = 0.8594, and 
p-value (Two Tail) = 0.2811. 

The development of the worksheet is now complete. For the two-tailed Holiday 
Toys problem we cannot reject Hj: u = 40 using a = .05 because the p-value (Two 
Tail) = 0.2811 is greater than a. This result indicates that Holiday should continue its 
production planning for the coming season based on the expectation that u = 40. The 
worksheet in Figure 9.8 can also be used for any one-tailed hypothesis test involving the 
t distribution. If a lower tail test is required, compare the p-value (Lower Tail) with a to 
make the rejection decision. If an upper tail test is required, compare the p-value (Upper 
Tail) with a to make the rejection decision. 


A Template for Other Problems The worksheet in Figure 9.8 can be used as a template 
for any hypothesis tests about a population mean for the ø unknown case. Just enter the ap- 
propriate data in column A, adjust the ranges for the formulas in cells D4:D6, and enter the 
hypothesized value in cell D8. The standard error, the test statistic, and the three p-values 
will then appear. Depending on the form of the hypothesis test (lower tail, upper tail, or two- 
tailed), we can then choose the appropriate p-value to make the rejection decision. 

We can further simplify the use of Figure 9.8 as a template for other problems by 
eliminating the need to enter new data ranges in cells D4:D6. To do so we rewrite the cell 
formulas as follows: 


Cell D4: =COUNT(A:A) 
Cell D5: =AVERAGE(A:A) 
Cell D6: =STDEV(A:A) 


With the A:A method of specifying data ranges, Excel’s COUNT function will count the 
number of numeric values in column A, Excel’s AVERAGE function will compute the 
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TABLE 9.3 Summary of Hypothesis Tests About a Population Mean: ø Unknown Case 


Lower Tail Test Upper Tail Test Two-Tailed Test 
Ho: u = Mo Ho: u = Mo Ho: u = Mo 
Hypotheses H; H < Ho H; M > Ho Ha: u * Ho 
ie eho pets pen aati 
Test Statistic ~ s/Vn ~s/Vn ~s/Vn 
Rejection Rule: Reject Hp if Reject Hg if Reject Hp if 
p-Value Approach p-value <a p-value <a p-value < a 
Rejection Rule: Reject Ho if Reject Hp if Reject Ho if 
Critical Value Ets C= th, ES hyp 
Approach Oi e= te 


The file Orders includes average of the numeric values in column A, and Excel’s STDEV function will compute 

a worksheet entitled the standard deviation of the numeric values in Column A. Thus, to solve a new problem it 
Template that uses the is only necessary to enter the new data in column A and enter the hypothesized value of the 
A:A method for entering population mean in cell D8. 

the data ranges. 


Summary and Practical Advice 


Table 9.3 provides a summary of the hypothesis testing procedures about a population 
mean for the ø unknown case. The key difference between these procedures and the ones 
for the o known case is that s is used, instead of øg, in the computation of the test statistic. 
For this reason, the test statistic follows the ż distribution. 

The applicability of the hypothesis testing procedures of this section is dependent on the 
distribution of the population being sampled from and the sample size. When the popu- 
lation is normally distributed, the hypothesis tests described in this section provide exact 
results for any sample size. When the population is not normally distributed, the procedures 
are approximations. Nonetheless, we find that sample sizes of 30 or greater will provide 
good results in most cases. If the population is approximately normal, small sample sizes 
(e.g., n < 15) can provide acceptable results. If the population is highly skewed or contains 
outliers, sample sizes approaching 50 are recommended. 


NOTE + COMMENT 


When testing a hypothesis of the population mean with a distribution of X when a is 
sample size that is at least 5% of the population size (that 
is, n/N = .05), the finite population correction factor should unknown, i.e., s; = ve (5) 


N = 1\Vn 


be used when calculating the standard error of the sampling 


Methods 

23. Consider the following hypothesis test: 
Ay: u = 12 
H; u > 12 
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A sample of 25 provided a sample mean x = 14 and a sample standard deviation 
s = 4.32. 
a. Compute the value of the test statistic. 
b. Use the ¢ distribution table (Table 2 in online Appendix B) to compute a range for 
the p-value. 
c. Ata = .05, what is your conclusion? 
d. What is the rejection rule using the critical value? What is your conclusion? 
24. Consider the following hypothesis test: 


A: w = 18 
H,: w # 18 


A sample of 48 provided a sample mean x = 17 and a sample standard deviation 
S=A45. 
a. Compute the value of the test statistic. 
b. Use the ¢ distribution table (Table 2 in online Appendix B) to compute a range for 
the p-value. 
c. Ata = .05, what is your conclusion? 
d. What is the rejection rule using the critical value? What is your conclusion? 
25. Consider the following hypothesis test: 


Ao: w 2 45 
Hi: bw < 45 


A sample of 36 is used. Identify the p-value and state your conclusion for each of the 
following sample results. Use a = .01. 
a. x = 44ands = 5.2 
b. x = 43 ands = 4.6 
c. x = 46 and s = 5.0 
26. Consider the following hypothesis test: 


Hp: = 100 
H; u # 100 


A sample of 65 is used. Identify the p-value and state your conclusion for each of the 
following sample results. Use a = .05. 

a. x = 103 ands = 11.5 

b. x = 96.5 and s = 11.0 

c. x = 102 ands = 10.5 


Applications 

27. Price of Good Red Wine. According to the Vivino website, the mean price for a bottle 
of red wine that scores 4.0 or higher on the Vivino Rating System is $32.48. A New 
England-based lifestyle magazine wants to determine if red wines of the same quality 
are less expensive in Providence, and it has collected prices for 56 randomly selected 
red wines of similar quality from wine stores throughout Providence. The mean and 
standard deviation for this sample are $30.15 and $12, respectively. 

a. Develop appropriate hypotheses for a test to determine whether the sample data 
support the conclusion that the mean price in Providence for a bottle of red wine 
that scores 4.0 or higher on the Vivino Rating System is less than the population 
mean of $32.48. 

b. Using the sample from the 56 bottles, what is the p-value? 

c. Ata = .05, what is your conclusion? 

d. Repeat the preceding hypothesis test using the critical value approach. 

28. CEO Tenure. A shareholders’ group, in lodging a protest, claimed that the mean ten- 
ure for a chief executive officer (CEO) was at least nine years. A survey of companies 
reported in The Wall Street Journal found a sample mean tenure of x = 7.27 years for 
CEOs with a standard deviation of s = 6.38 years. 
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a. Formulate hypotheses that can be used to challenge the validity of the claim made 
by the shareholders’ group. 
b. Assume 85 companies were included in the sample. What is the p-value for your 
hypothesis test? 
c. Ata = .01, what is your conclusion? 
i 29. Cost of Residential Water. On its municipal website, the city of Tulsa states that the 
DATAf ile rate it charges per 5 CCF of residential water is $21.62. How do the residential water 
ResidentialWater rates of other U.S. public utilities compare to Tulsa’s rate? The file ResidentialWater 

contains the rate per 5 CCF of residential water for 42 randomly selected U.S. cities. 

a. Formulate hypotheses that can be used to determine whether the population mean 
rate per 5 CCF of residential water charged by U.S. public utilities differs from the 
$21.62 rate charged by Tulsa. 

b. What is the p-value for your hypothesis test in part (a)? 

c. Ata = .05, can your null hypothesis be rejected? What is your conclusion? 

d. Repeat the preceding hypothesis test using the critical value approach. 

DATA fi le 30. Time in Child Care. The time married men with children spend on child care aver- 
ages 6.4 hours per week. You belong to a professional group on family practices that 
would like to do its own study to determine whether the time married men in your area 
spend on child care per week differs from the reported mean of 6.4 hours per week. 

A sample of 40 married couples will be used with the data collected showing the 

hours per week the husband spends on child care. The sample data are contained in the 

file ChildCare. 

a. What are the hypotheses if your group would like to determine whether the popula- 
tion mean number of hours married men are spending in child care differs from the 
mean reported by Time in your area? 

b. What is the sample mean and the p-value? 

c. Select your own level of significance. What is your conclusion? 

31. Chocolate Consumption. The United States ranks ninth in the world in per capita 
chocolate consumption; Forbes reports that the average American eats 9.5 pounds of 
chocolate annually. Suppose you are curious whether chocolate consumption is higher 
in Hershey, Pennsylvania, the location of The Hershey Company’s corporate headquar- 
ters. A sample of 36 individuals from the Hershey area showed a sample mean annual 
consumption of 10.05 pounds and a standard deviation of s = 1.5 pounds. Using 
a = .05, do the sample results support the conclusion that mean annual consumption 
of chocolate is higher in Hershey than it is throughout the United States? 

; 32. Used Car Prices. According to the National Automobile Dealers Association, the 

DATA f ile mean price for used cars is $10,192. A manager of a Kansas City used car dealer- 

UsedCars ship reviewed a sample of 50 recent used car sales at the dealership in an attempt to 
determine whether the population mean price for used cars at this particular dealership 
differed from the national mean. The prices for the sample of 50 cars are shown in the 
file UsedCars. 

a. Formulate the hypotheses that can be used to determine whether a difference exists 
in the mean price for used cars at the dealership. 

b. What is the p-value? 

c. Ata = .05, what is your conclusion? 

33. Automobile Insurance Premiums. Insure.com reports that the mean annual premium for 
automobile insurance in the United States was $1365 in 2018. Being from Pennsylvania, 
you believe automobile insurance is cheaper there and wish to develop statistical support 
for your opinion. A sample of 25 automobile insurance policies from the state of Pennsyl- 
vania showed a mean annual premium of $1302 with a standard deviation of s = $165. 

a. Develop a hypothesis test that can be used to determine whether the mean annual 
premium in Pennsylvania is lower than the national mean annual premium. 

b. What is a point estimate of the difference between the mean annual premium in 
Pennsylvania and the national mean? 

c. Ata = .05, test for a significant difference. What is your conclusion? 


(@ 


(Q 


ChildCare 


(@ 
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34. Landscaping Labor Costs. Joan’s Nursery specializes in custom-designed landscap- 
ing for residential areas. The estimated labor cost associated with a particular landscap- 
ing proposal is based on the number of plantings of trees, shrubs, and so on to be used 
for the project. For cost-estimating purposes, managers use two hours of labor time for 
the planting of a medium-sized tree. Actual times from a sample of 10 plantings during 
the past month follow (times in hours). 


Le 1.5 2.6 2.2 2.4 2.3 2.6 3.0 14 2.3 


With a .05 level of significance, test to see whether the mean tree-planting time differs 
from two hours. 

a. State the null and alternative hypotheses. 

b. Compute the sample mean. 

c. Compute the sample standard deviation. 

d. What is the p-value? 

e. What is your conclusion? 


9.5 Population Proportion 


In this section we show how to conduct a hypothesis test about a population proportion p. 
Using pọ to denote the hypothesized value for the population proportion, the three forms 
for a hypothesis test about a population proportion are as follows. 

Ho: p = Po Ho: p = Po Ho: p = Po 

A P<Po Hep>Po HP # Po 
The first form is called a lower tail test, the second form is called an upper tail test, and the 
third form is called a two-tailed test. 

Hypothesis tests about a population proportion are based on the difference between the 
sample proportion p and the hypothesized population proportion pọ. The methods used to 
conduct the hypothesis test are similar to those used for hypothesis tests about a population 
mean. The only difference is that we use the sample proportion and its standard error to 
compute the test statistic. The p-value approach or the critical value approach is then used 
to determine whether the null hypothesis should be rejected. 

Let us consider an example involving a situation faced by Pine Creek golf course. Over 
the past year, 20% of the players at Pine Creek were women. In an effort to increase the 
proportion of women players, Pine Creek implemented a special promotion designed to 
attract women golfers. One month after the promotion was implemented, the course man- 
ager requested a statistical study to determine whether the proportion of women players at 
Pine Creek had increased. Because the objective of the study is to determine whether the 
proportion of women golfers increased, an upper tail test with H,: p > .20 is appropriate. 
The null and alternative hypotheses for the Pine Creek hypothesis test are as follows: 

Hy: p = .20 

H,: p > .20 
If H, can be rejected, the test results will give statistical support for the conclusion that the 
proportion of women golfers increased and the promotion was beneficial. The course manager 
specified that a level of significance of a = .05 be used in carrying out this hypothesis test. 

The next step of the hypothesis testing procedure is to select a sample and compute 
the value of an appropriate test statistic. To show how this step is done for the Pine Creek upper 
tail test, we begin with a general discussion of how to compute the value of the test statistic for 
any form of a hypothesis test about a population proportion. The sampling distribution of p, the 
point estimator of the population parameter p, is the basis for developing the test statistic. 

When the null hypothesis is true as an equality, the expected value of p equals the 
hypothesized value po; that is, E( p) = po. The standard error of p is given by 


Po = Po) 
o= q ~ 
P n 
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In Chapter 7 we indicated that if np = 5 and n(1 — p) = 5, the sampling distribution of 
p can be approximated by a normal distribution.* Under these conditions, which usually 
apply in practice, the quantity 

P~ Po 


oF 


(9.3) 


has a standard normal probability distribution. With o, = V po(1 — po) /n, the standard 
normal random variable z is the test statistic used to conduct hypothesis tests about a 
population proportion. 


TEST STATISTIC FOR HYPOTHESIS TESTS ABOUT A POPULATION PROPORTION 
P — Po 
A (9.4) 
Po(l Ti Po) 
n 


DATA fil We can now compute the test statistic for the Pine Creek hypothesis test. Suppose a 
J ue random sample of 400 players was selected, and that 100 of the players were women. The 
proportion of women golfers in the sample is 


(Q 


WomenGolf 


Using equation (9.4), the value of the test statistic is 


P-P 25 -.20. 05 


l= Pe) 200 — 20) -02 
n 400 


Because the Pine Creek hypothesis test is an upper tail test, the p-value is the probability 
that z is greater than or equal to z = 2.50; that is, it is the upper tail area corresponding to 
z = 2.50. Using the standard normal probability table, we find that the lower tail area for 
z = 2.50 is .9938. Thus, the p-value for the Pine Creek test is 1.0000 — .9938 = .0062. 
Figure 9.9 shows this p-value calculation. 


FIGURE 9.9 Calculation of the p-Value for the Pine Creek Hypothesis Test 


Area = .9938 


p-value = P(z = 2.50) = .0062 


2.5 


4In most applications involving hypothesis tests of a population proportion, sample sizes are large enough to use the 
normal approximation. The exact sampling distribution of p is discrete, with the probability for each value of p given 
by the binomial distribution. So hypothesis testing is a bit more complicated for small samples when the normal 
approximation cannot be used. 
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Recall that the course manager specified a level of significance of a = .05. A p-value = 
.0062 < .05 gives sufficient statistical evidence to reject Hy at the .05 level of significance. 
Thus, the test provides statistical support for the conclusion that the special promotion 
increased the proportion of women players at the Pine Creek golf course. 

The decision whether to reject the null hypothesis can also be made using the critical 
value approach. The critical value corresponding to an area of .05 in the upper tail of a nor- 
mal probability distribution is z); = 1.645. Thus, the rejection rule using the critical value 
approach is to reject Hy if z = 1.645. Because z = 2.50 > 1.645, Hp is rejected. 

Again, we see that the p-value approach and the critical value approach lead to the same 
hypothesis testing conclusion, but the p-value approach provides more information. With a 
p-value = .0062, the null hypothesis would be rejected for any level of significance greater 
than or equal to .0062. 


Excel can be used to conduct one-tailed and two-tailed hypothesis tests about a population 
proportion using the p-value approach. The procedure is similar to the approach used with 
Excel in conducting hypothesis tests about a population mean. The primary difference is 
that the test statistic is based on the sampling distribution of x for hypothesis tests about a 
population mean and on the sampling distribution of p for hypothesis tests about a popu- 
lation proportion. Thus, although different formulas are used to compute the test statistic 
needed to make the hypothesis testing decision, the computations of the critical value and 
the p-value for the tests are identical. 

We will illustrate the procedure by showing how Excel can be used to conduct the upper 
tail hypothesis test for the Pine Creek golf course study. Refer to Figure 9.10 as we de- 
scribe the tasks involved. The formula worksheet is in the background; the value worksheet 
is in the foreground. 


Enter/Access Data: Open the file WomenGolf. A label and the gender of each golfer in the 
study are entered in cells Al:A401. 


Enter Functions and Formulas: The sample size, response count, and sample proportion are 
calculated in cells D3, D5, and D6. Because the data are not numeric, Excel’s COUNTA 
function, not the COUNT function, is used in cell D3 to determine the sample size. 


FIGURE 9.10 | Excel Worksheet: Hypothesis Test for Pine Creek Golf Course 


A E | c D E 
1 Hypothesis Test about a Population Proportion 
2 | Female 
3| Male Sample Size =COUNTA(A2:A401) j 5 E D E| F 
4 Female Response of Interest Female es | : 3 
5 | k Count for Response =COUNTIF(AZ:A401D4) j | Golfer | Hypothesis Test about a Population Proportion 
6 Male Sample Proportion =DS5/D3 3 | Male Sample Size 400 
7| Female y ‘emale 
te | tpehetae Vale 02 |. eee ii 
10| Female Standard Error =SORTDS#(1-DSD3) $ pale es m 
11. Male Test Statistic z =(D6-D8)/D10 3 e Hypothesized Value 0.20 
d : 12 Male 9 Male 
© DATA file 1 mate p-value (Lower Tai “NORMSDIST(DIL-TRUE) 10, Female —s em one 
— 14) Male p-value (Upper Tail) =1-D13 11 Male Test Statistic z 2.5000 
WomenGolf 15. Male p-value (TwoTail) =2*MIN(D13,D14) ag a 
16. Female 13. Male p-value (Lower Tail) 0.9938 
4003 Male 14 Male p-value (Upper Tail) 0.0062 
401 Male 15. Male p-value (TwoTail) 0.0124 
- ao 
Note: Rows 17-399 401 Male 
are hidden. 402 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


9.5 Population Proportion 431 


We entered Female in cell D4 to identify the response for which we wish to compute a pro- 
portion. The COUNTIF function is then used in cell D5 to determine the number of responses 
of the type identified in cell D4. The sample proportion is then computed in cell D6 by divid- 
ing the response count by the sample size. 


The hypothesized value of the population proportion (.20) is entered in cell D8. The 
standard error is obtained in cell D10 by entering the formula =SQRT(D8*(1-D8)/D3). The 
formula =(D6-D8)/D 10 entered in cell D11 computes the test statistic z (2.50). To compute 
the p-value for a lower tail test, we enter the formula =NORM.S.DIST(D11,TRUE) in cell 
D13. The p-value for an upper tail test is then computed in cell D14 as 1 minus the p-value 
for the lower tail test. Finally, the p-value for a two-tailed test is computed in cell D15 as two 
times the minimum of the two one-tailed p-values. The value worksheet shows that the three 
p-values are as follows: p-value (Lower Tail) = 0.9938; p-value (Upper Tail) = 0.0062; and 
p-value (Two Tail) = 0.0124. 

The development of the worksheet is now complete. For the Pine Creek upper tail 
hypothesis test, we reject the null hypothesis that the population proportion is .20 or less 
because the p-value (Upper Tail) = 0.0062 is less than a = .05. Indeed, with this p-value 
we would reject the null hypothesis for any level of significance of .0062 or greater. 


A Template for Other Problems The worksheet in Figure 9.10 can be used as a template 
for hypothesis tests about a population proportion whenever np = 5 and n(1 — p) = 5. Just 
enter the appropriate data in column A, adjust the ranges for the formulas in cells D3 and 
D5, enter the appropriate response in cell D4, and enter the hypothesized value in cell D8. 
The standard error, the test statistic, and the three p-values will then appear. Depending on 
the form of the hypothesis test (lower tail, upper tail, or two-tailed), we can then choose the 
appropriate p-value to make the rejection decision. 


Summary 


The procedure used to conduct a hypothesis test about a population proportion is similar 
to the procedure used to conduct a hypothesis test about a population mean. Although 

we only illustrated how to conduct a hypothesis test about a population proportion for an 
upper tail test, similar procedures can be used for lower tail and two-tailed tests. Table 9.4 
provides a summary of the hypothesis tests about a population proportion. We assume that 
np = 5 and n(1 — p) = 5; thus the normal probability distribution can be used to approxi- 
mate the sampling distribution of p. 


TABLE 9.4 Summary of Hypothesis Tests About a Population Proportion 


Lower Tail Test Upper Tail Test Two-Tailed Test 
Ine (= hia = Ho: p = 
Hypotheses oe Po oe Po 0° P = Po 
4 H: P < Po Ha: P > Po Ha: P * Po 
oe P= Po 3 P= Po TA P- Po 
Test Statistic POE Poll — Po) Poll — Po) 
n n n 
Rejection Rule: Reject Ho if Reject Ho if Reject Hp if 
p-Value Approach p-value <a p-value = a p-value <a 
Rejection Rule: Reject Ho if Reject Ho if Reject Ho if 
Critical Value Ze 72n, DEF yp 
Approach or if z = Z2 
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NOTE + COMMENT 


When testing a hypothesis of the population proportion sampling distribution 
with a sample size that is at least 5% of the population size 


(that is, n/N = .05), the finite population correction factor ofp, i.e., s; IN—n JP- Po) 
should be used when calculating the standard error of the N= n 


EXERCISES 


Methods 

35. Consider the following hypothesis test: 
Hy: p = .20 
Hip # .20 


A sample of 400 provided a sample proportion p = .175. 

a. Compute the value of the test statistic. 

b. What is the p-value? 

c. Ata = .05, what is your conclusion? 

d. What is the rejection rule using the critical value? What is your conclusion? 
36. Consider the following hypothesis test: 


Hy: p = .75 

Ap <215 
A sample of 300 items was selected. Compute the p-value and state your conclusion 
for each of the following sample results. Use a = .05. 


a. p = .68 c. p=.70 
b. p =.72 d. p=.77 
Applications 


37. Union Membership. The U.S. Bureau of Labor Statistics reports that 10.2% of 
U.S. workers belonged to unions in 2018. Suppose a sample of 400 U.S. workers is 
collected in 2019 to determine whether union efforts to organize have increased union 
membership. 

a. Formulate the hypotheses that can be used to determine whether union membership 
increased in 2019. 

b. If the sample results show that 48 of the workers belonged to unions, what is the 
p-value for your hypothesis test? 

c. Ata = .05, what is your conclusion? 

38. Attitudes toward Supermarket Brands. A study by Consumer Reports showed that 
64% of supermarket shoppers believe supermarket brands to be as good as national 
name brands. To investigate whether this result applies to its own product, the manu- 
facturer of a national name-brand ketchup asked a sample of shoppers whether they 
believed that supermarket ketchup was as good as the national brand ketchup. 

a. Formulate the hypotheses that could be used to determine whether the percentage 
of supermarket shoppers who believe that the supermarket ketchup was as good as 
the national brand ketchup differed from 64%. 
b. If a sample of 100 shoppers showed 52 stating that the supermarket brand was as 
good as the national brand, what is the p-value? 
At a = .05, what is your conclusion? 
d. Should the national brand ketchup manufacturer be pleased with this conclusion? 
Explain. 


© 
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; 39. Population Mobility. What percentage of the population live in their state of birth? 

DATA f ile According to the U.S. Census Bureau’s American Community Survey, the figure 
ranges from 25% in Nevada to 78.7% in Louisiana. The average percentage across 
all states and the District of Columbia is 57.7%. The data in the file Homestate are 
consistent with the findings in the American Community Survey. The data represent 
a random sample of 120 Arkansas residents and for a random sample of 180 Virginia 
residents. 

a. Formulate hypotheses that can be used to determine whether the percentage of stay- 
at-home residents in the two states differs from the overall average of 57.7%. 

b. Estimate the proportion of stay-at-home residents in Arkansas. Does this proportion 
differ significantly from the mean proportion for all states? Use a = .05. 

c. Estimate the proportion of stay-at-home residents in Virginia. Does this proportion 
differ significantly from the mean proportion for all states? Use a = .05. 

d. Would you expect the proportion of stay-at-home residents to be higher in Virginia 
than in Arkansas? Support your conclusion with the results obtained in parts (b) 
and (c). 

40. Holiday Gifts from Employers. Last year, 46% of business owners gave a holiday 
gift to their employees. A survey of business owners conducted this year indicated that 
35% plan to provide a holiday gift to their employees. Suppose the survey results are 
based on a sample of 60 business owners. 

a. How many business owners in the survey plan to provide a holiday gift to their 
employees this year? 

b. Suppose the business owners in the sample do as they plan. Compute the p-value 
for a hypothesis test that can be used to determine whether the proportion of busi- 
ness owners providing holiday gifts has decreased from last year. 

c. Using a .05 level of significance, would you conclude that the proportion of busi- 
ness owners providing gifts has decreased? What is the smallest level of signific- 
ance for which you could draw such a conclusion? 

41. Adequate Preparation for Retirement. In 2018, RAND Corporation researchers 
found that 71% of all individuals ages 66 to 69 are adequately prepared financially for 
retirement. Many financial planners have expressed concern that a smaller percentage 
of those in this age group who did not complete high school are adequately prepared 
financially for retirement. 

a. Develop appropriate hypotheses such that rejection of Hy will support the con- 
clusion that the proportion of those who are adequately prepared financially for 
retirement is smaller for people in the 66—69 age group who did not complete high 
school than it is for the population of the 66—69 age group. 

b. In a random sample of 300 people from the 66—69 age group who did not complete 
high school, 165 were not prepared financially for retirement. What is the p-value 
for your hypothesis test? 

c. Ata = .01, what is your conclusion? 

42. Returned Merchandise. According to the University of Nevada Center for Logistics 
Management, 6% of all merchandise sold in the United States gets returned. A Hous- 
ton department store sampled 80 items sold in January and found that 12 of the items 
were returned. 

a. Construct a point estimate of the proportion of items returned for the population of 
sales transactions at the Houston store. 

b. Construct a 95% confidence interval for the proportion of returns at the Houston 
store. 

c. Is the proportion of returns at the Houston store significantly different from the 
returns for the nation as a whole? Provide statistical support for your answer. 

: 43. Coupon Usage. Eagle Outfitters is a chain of stores specializing in outdoor apparel 

DATAT ile and camping gear. It is considering a promotion that involves mailing discount cou- 

Eagle pons to all its credit card customers. This promotion will be considered a success if 


(Q 


HomeState 


(@ 
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more than 10% of those receiving the coupons use them. Before going national with 

the promotion, coupons were sent to a sample of 100 credit card customers. 

a. Develop hypotheses that can be used to test whether the population proportion of 
those who will use the coupons is sufficient to go national. 

b. The file Eagle contains the sample data. Develop a point estimate of the population 
proportion. 

c. Use a = .05 to conduct your hypothesis test. Should Eagle go national with the 
promotion? 

DATA fil 44. Malpractice Suits. One of the reasons health care costs have been rising rapidly in 

f HE recent years is the increasing cost of malpractice insurance for physicians. Also, fear 
of being sued causes doctors to run more precautionary tests (possibly unnecessary) 
just to make sure they are not guilty of missing something. These precautionary tests 

also add to health care costs. Data in the file LawSuit are consistent with findings in a 

Reader’s Digest article and can be used to estimate the proportion of physicians over 

the age of 55 who have been sued at least once. 

a. Formulate hypotheses that can be used to see if these data can support a finding that 
more than half of physicians over the age of 55 have been sued at least once. 

b. Use Excel and the file LawSuit to compute the sample proportion of physicians 
over the age of 55 who have been sued at least once. What is the p-value for your 
hypothesis test? 

c. Ata = .01, what is your conclusion? 

45. Bullish, Neutral, or Bearish. The American Association of Individual Investors 
(AAID conducts a weekly survey of its members to measure the percent who are 
bullish, bearish, and neutral on the stock market for the next six months. For the week 
ending March 27, 2019, the survey results showed 33.2% bullish, 39.6% neutral, and 
27.2% bearish. Assume these results are based on a sample of 300 AAIT members. 

a. Over the long term, the proportion of bullish AATI members is .385. Conduct a hypoth- 
esis test at the 5% level of significance to see if the current sample results show that 
bullish sentiment differs from its long-term average of .385. What are your findings? 

b. Over the long term, the proportion of bearish AAII members is .31. Conduct a 
hypothesis test at the 1% level of significance to see if the current sample re- 
sults show that bearish sentiment is above its long-term average of .31. What are 
your findings? 

c. Would you feel comfortable extending these results to all investors? Why or why not? 


(@ 


LawSuit 


9.6 Practical Advice: Big Data and Hypothesis Testing 


We have seen that interval estimates of the population mean p and the population propor- 
tion p narrow as the sample size increases. This occurs because the standard error of the 
associated sampling distributions decrease as the sample size increases. Now consider the 
relationship between interval estimation and hypothesis testing that we discussed earlier 
in this chapter. If we construct a 100(1 — @)% interval estimate for the population mean, 
we reject Hy: u = My if the 100(1 — a)% interval estimate does not contain mo. Thus, for 
a given level of confidence, as the sample size increases we will reject Hp: u = My for 
increasingly smaller differences between the sample mean x and the hypothesized popula- 
tion mean mo. We can see that when the sample size n is very large, almost any difference 
between the sample mean x and the hypothesized population mean mg results in rejection 
of the null hypothesis. 


Big Data, Hypothesis Testing, and p-Values 


In this section, we will elaborate how big data affects hypothesis testing and the magnitude 
of p-values. Specifically, we will examine how rapidly the p-value associated with a given 
difference between a point estimate and a hypothesized value of a parameter decreases as 
the sample size increases. 
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TABLE 9.5 Values of the Test Statistic t and the p-Values for the Test of the 


Null Hypothesis Hp: u = 84 and Sample Mean x = 84.1 Seconds 
for Various Sample Sizes 


Sample Size n t p-Value 
10 .01581 49386 
100 .05000 .48011 
1,000 .15811 .43720 
10,000 .50000 30854 
100,000 1.58114 05692 
1,000,000 5.00000 2.87E-07 
10,000,000 15.81139 1.30E-56 
100,000,000 50.00000 .00E+00 
1,000,000,000 158.11388 .00E+00 


Let us consider the online news service PenningtonDailyTimes.com (PDT). PDT’s primary 
source of revenue is the sale of advertising, and prospective advertisers are willing to pay a pre- 
mium to advertise on websites that have long visit times. To promote its news service, PDT’s 
management wants to promise potential advertisers that the mean time spent by customers 
when they visit PenningtonDailyTimes.com is greater than last year, that is, more than 84 sec- 
onds. PDT therefore decides to collect a sample tracking the amount of time spent by individ- 
ual customers when they visit PDT’s website in order to test its null hypothesis Hp: u = 84. 

For a sample mean of 84.1 seconds and a sample standard deviation of s = 20 seconds, 
Table 9.5 provides the values of the test statistic f and the p-values for the test of the null 
hypothesis H,: u = 84. The p-value for this hypothesis test is essentially 0 for all samples 
in Table 9.5 with at least n = 1,000,000. 

PDT’s management also wants to promise potential advertisers that the proportion of its 
website visitors who click on an ad this year exceeds the proportion of its website visitors 
who clicked on an ad last year, which was .50. PDT collects information from its sample 
on whether the visitor to its website clicked on any of the ads featured on the website, and 
it wants to use these data to test its null hypothesis Hy: p = .50. 

For a sample proportion of .51, Table 9.6 provides the values of the test statistic z and 
the p-values for the test of the null hypothesis Hp: p = .5 p-value for this hypothesis test is 
essentially 0 for all samples in Table 9.6 with at least n = 100,000. 

We see in Tables 9.5 and 9.6 that the p-value associated with a given difference between a 
point estimate and a hypothesized value of a parameter decreases as the sample size increases. 


TABLE 9.6 Values of the Test Statistic z and the p-Values for the Test of the 


Null Hypothesis Hp: p = .50 and Sample Proportion p = .51 for 
Various Sample Sizes 


Sample Size n z p-Value 

10 .06325 47479 

100 .20000 .42074 

1,000 .63246 .26354 

10,000 2.00000 .02275 
100,000 6.32456 1.27E-10 
1,000,000 20.00000 .OOE+00 
10,000,000 63.24555 .00E+00 
100,000,000 200.00000 .00E+00 
1,000,000,000 632.45553 .00E+00 
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As a result, if the sample mean time spent by customers when they visit PDT’s website is 

84.1 seconds, PDT’s null hypothesis H: u = 84 is not rejected at a = .01 for samples with 

n = 100,000, and is rejected at a = .01 for samples with n = 1,000,000. Similarly, if the sam- 
ple proportion of visitors to its website clicked on an ad featured on the website is .51, PDT’s 
null hypothesis Hg: p S .50 is not rejected at a = .01 for samples with n = 10,000, and is re- 
jected at a = .01 for samples with n = 100,000. In both instances, as the sample size becomes 
extremely large the p-value associated with the given difference between a point estimate and 
the hypothesized value of the parameter becomes extremely small. 


Implications of Big Data in Hypothesis Testing 


Suppose PDT collects a sample of 1,000,000 visitors to its website and uses these data to test 

its null hypotheses Hp: u = 84 and A): p = .50 at the .05 level of significance. The sample 
mean is 84.1 and the sample proportion is .51, so the null hypothesis is rejected in both tests as 
Tables 9.5 and 9.6 show. As a result, PDT can promise potential advertisers that the mean time 
spent by individual customers who visit PDT’s website exceeds 84 seconds and the proportion 
individual visitors to of its website who click on an ad exceeds .50. These results suggest that for 
each of these hypothesis tests, the difference between the point estimate and the hypothesized 
value of the parameter being tested is not likely solely a consequence of sampling error. How- 
ever, the results of any hypothesis test, no matter the sample size, are only reliable if the sample 
is relatively free of nonsampling error. If nonsampling error is introduced in the data collection 
process, the likelihood of making a Type I or Type II error may be higher than if the sample data 
are free of nonsampling error. Therefore, when testing a hypothesis, it is always important to 
think carefully about whether a random sample of the population of interest has been taken. 

If PDT determines that it has introduced little or no nonsampling error into its sample 
data, the only remaining plausible explanation for these results is that these null hypotheses 
are false. At this point, PDT and the companies that advertise on PenningtonDailyTimes. 
com should also consider whether these statistically significant differences between the 
point estimates and the hypothesized values of the parameters being tested are of practical 
significance. Although a .1 second increase in the mean time spent by customers when 
they visit PDT’s website is statistically significant, it may not be meaningful to companies 
that might advertise on PenningtonDailyTimes.com. Similarly, although an increase of .01 
in the proportion of visitors to its website that click on an ad is statistically significant, it 
may not be meaningful to companies that might advertise on PenningtonDailyTimes.com. 
Determining whether these statistically significant differences have meaningful implica- 
tions for ensuing business decisions of PDT and its advertisers. 

Ultimately, no business decision should be based solely on statistical inference. Practical 
significance should always be considered in conjunction with statistical significance. This 
is particularly important when the hypothesis test is based on an extremely large sample be- 
cause even an extremely small difference between the point estimate and the hypothesized 
value of the parameter being tested will be statistically significant. When done properly, 
statistical inference provides evidence that should be considered in combination with infor- 
mation collected from other sources to make the most informed decision possible. 


NOTES + COMMENTS 


1. 


Nonsampling error can occur when either a probability 2. When taking an extremely large sample, it is conceiv- 


sampling technique or a nonprobability sampling tech- 
nique is used. However, nonprobability sampling tech- 
niques such as convenience sampling and judgment 
sampling often introduce nonsampling error into sample 
data because of the manner in which sample data are 
collected. Therefore, probability sampling techniques are 
preferred over nonprobability sampling techniques. 
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able that the sample size is at least 5% of the popula- 
tion size; that is, n/N = .05. Under these conditions, it is 
necessary to use the finite population correction factor 
when calculating the standard error of the sampling dis- 
tribution to be used in confidence intervals and hypo- 
thesis testing. 
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Applications 
. 46. Government Use of email. The Federal Government wants to determine whether the 

DATA file . . . . l 
mean number of business emails sent and received per business day by its employ- 

ees differs from the mean number of emails sent and received per day by corporate 

employees, which is 101.5. Suppose the department electronically collects information 

on the number of business emails sent and received on a randomly selected business 
day over the past year from each of 10,163 randomly selected federal employees. The 

results are provided in the file FedEmail. Test the department’s hypothesis at a = .01. 

Discuss the practical significance of the results. 

DATA fil e 47. CEOs and Social Networks. CEOs who belong to a popular business-oriented social 
networking service have an average of 930 connections. Do other members have 
fewer connections than CEOs? The number of connections for a random sample of 
7515 members who are not CEOs is provided in the file SocialNetwork. Using this 
sample, test the hypothesis that other members have fewer connections than CEOs at 
a = .01. Discuss the practical significance of the results. 

48. French Fry Purchases. The American Potato Growers Association (APGA) would 
like to test the claim that the proportion of fast food orders this year that includes 
French fries exceeds the proportion of fast food orders that included French fries last 
year. Suppose that a random sample of 49,581 electronic receipts for fast food orders 
placed this year shows that 31,038 included French fries. Assuming that the proportion 
of fast food orders that included French fries last year is .62, use this information to 
test APGA’s claim at œ = .05. Discuss the practical significance of the results. 

49. GPS Usage in Canada. According to CNN, 55% of all U.S. smartphone users have 
used their GPS capability to get directions. Suppose a provider of wireless telephone 
service in Canada wants to know if GPS usage by its customers differs from U.S. 
smartphone users. The company collects usage records for this year for a random 
sample of 547,192 of its customers and determines that 302,050 of these customers 
have used their telephone’s GPS capability this year. Use this information to test the 
Canadian company’s claim at a = .01. Discuss the practical significance of the results. 


(Q 


FedEmail 


(@ 


SocialNetwork 


SUMMARY — 


Hypothesis testing is a statistical procedure that uses sample data to determine whether a 
statement about the value of a population parameter should or should not be rejected. The 
hypotheses are two competing statements about a population parameter. One statement is 
called the null hypothesis (H,), and the other statement is called the alternative hypothe- 
sis (H,). In Section 9.1 we provided guidelines for developing hypotheses for situations 
frequently encountered in practice. 

Whenever historical data or other information provide a basis for assuming that the 
population standard deviation is known, the hypothesis testing procedure for the population 
mean is based on the standard normal distribution. Whenever o is unknown, the sample 
standard deviation s is used to estimate o and the hypothesis testing procedure is based on 
the ¢ distribution. In both cases, the quality of results depends on both the form of the pop- 
ulation distribution and the sample size. If the population has a normal distribution, both 
hypothesis testing procedures are applicable, even with small sample sizes. If the popula- 
tion is not normally distributed, larger sample sizes are needed. General guidelines about 
the sample size were provided in Sections 9.3 and 9.4. In the case of hypothesis tests about 
a population proportion, the hypothesis testing procedure uses a test statistic based on the 
standard normal distribution. 

In all cases, the value of the test statistic can be used to compute a p-value for the test. A 
p-value is a probability used to determine whether the null hypothesis should be rejected. If the 
p-value is less than or equal to the level of significance a, the null hypothesis can be rejected. 
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Hypothesis testing conclusions can also be made by comparing the value of the test 
statistic to a critical value. For lower tail tests, the null hypothesis is rejected if the value 
of the test statistic is less than or equal to the critical value. For upper tail tests, the null 
hypothesis is rejected if the value of the test statistic is greater than or equal to the critical 
value. Two-tailed tests consist of two critical values: one in the lower tail of the sampling 
distribution and one in the upper tail. In this case, the null hypothesis is rejected if the 
value of the test statistic is less than or equal to the critical value in the lower tail or greater 
than or equal to the critical value in the upper tail. Finally, we discussed the ramifications 
of extremely large samples on hypothesis tests of the mean and proportion. 


GLOSSARY ee 


eeeeeneeoeeoeee eeeceeeveeeoeoeeeees 


Alternative hypothesis The hypothesis concluded to be true if the null hypothesis is rejected. 
Critical value A value that is compared with the test statistic to determine whether Ho 
should be rejected. 

Level of significance The probability of making a Type I error when the null hypothesis is 
true as an equality. 

Minimum significant difference The smallest difference between a point estimate and a 
hypothesized value of a parameter that will result in rejection of the null hypothesis for a 
given level of significance a. 

Null hypothesis The hypothesis tentatively assumed true in the hypothesis testing procedure. 
One-tailed test A hypothesis test in which rejection of the null hypothesis occurs for 
values of the test statistic in one tail of its sampling distribution. 

Practical significance The usefulness or meaningfulness to a decision maker of the differ- 
ence between a point estimate and the hypothesized value of a parameter. 

p-value A probability that provides a measure of the evidence against the null hypothesis 
provided by the sample. Smaller p-values indicate more evidence against H,. For a lower 
tail test, the p-value is the probability of obtaining a value for the test statistic as small as or 
smaller than that provided by the sample. For an upper tail test, the p-value is the probabil- 
ity of obtaining a value for the test statistic as large as or larger than that provided by the 
sample. For a two-tailed test, the p-value is the probability of obtaining a value for the test 
statistic at least as unlikely as or more unlikely than that provided by the sample. 

Test statistic A statistic whose value helps determine whether a null hypothesis should 

be rejected. 

Two-tailed test A hypothesis test in which rejection of the null hypothesis occurs for 
values of the test statistic in either tail of its sampling distribution. 

Type I error The error of rejecting Hy when it is true. 

Type II error The error of accepting H, when it is false. 


KEY FORMULAS | 


Test Statistic for Hypothesis Tests About a Population Mean: o Known 


X — Mo 
= 9.1 
: a/Vn a 
Test Statistic for Hypothesis Tests About a Population Mean: o Unknown 
X — Mo 
t= 9.2 
s/Vn oa 
Test Statistic for Hypothesis Tests About a Population Proportion 
P—P 
z= ° (9.4) 


Po(l = Po) 
n 
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SUPPLEMENTARY EXERCISES | 


50. Production Line Fill Weights. A production line operates with a mean filling weight 
of 16 ounces per container. Overfilling or underfilling presents a serious problem and 
when detected requires the operator to shut down the production line to readjust the 
filling mechanism. From past data, a population standard deviation ø = .8 ounces is 
assumed. A quality control inspector selects a sample of 30 items every hour and at 
that time makes the decision of whether to shut down the line for readjustment. The 
level of significance is a = .05. 

a. State the hypothesis test for this quality control application. 

b. If a sample mean of x = 16.32 ounces were found, what is the p-value? What 
action would you recommend? 

c. If asample mean of x = 15.82 ounces were found, what is the p-value? What 
action would you recommend? 

d. Use the critical value approach. What is the rejection rule for the preceding hypo- 
thesis testing procedure? Repeat parts (b) and (c). Do you reach the same conclusion? 

51. Scholarship Examination Scores. At Western University the historical mean of 
scholarship examination scores for freshman applications is 900. A historical popula- 
tion standard deviation a = 180 is assumed known. Each year, the assistant dean uses 
a sample of applications to determine whether the mean examination score for the new 
freshman applications has changed. 

a. State the hypotheses. 

b. What is the 95% confidence interval estimate of the population mean examination 
score if a sample of 200 applications provided a sample mean of x = 935? 

c. Use the confidence interval to conduct a hypothesis test. Using a = .05, what is 
your conclusion? 

d. What is the p-value? 

52. Exposure to Background Television. CNN reports that young children in the 
United States are exposed to an average of 4 hours of background television per day. 
Having the television on in the background while children are doing other activities 
may have adverse consequences on a child’s well-being. You have a research hypoth- 
esis that children from low-income families are exposed to more than 4 hours of daily 
background television. In order to test this hypothesis, you have collected a random 
sample of 60 children from low-income families and found that these children were 
exposed to a sample mean of 4.5 hours of daily background television. 

a. Develop hypotheses that can be used to test your research hypothesis. 

b. Based on a previous study, you are willing to assume that the population standard 
deviation is æ = 1.5 hours. What is the p-value based on your sample of 60 children 
from low-income families? 

c. Use a = .01 as the level of significance. What is your conclusion? 

53. Starting Salaries for Business Graduates. Michigan State University’s Collegiate 
Employment Research Institute found that starting salary for recipients of bachelor’s 
degrees in business was $50,032 in 2017. The results for a sample of 100 business 
majors receiving a bachelor’s degree in 2018 showed a mean starting salary of $51,276 
with a sample standard deviation of $5200. Conduct a hypothesis test to determine 
whether the mean starting salary for business majors in 2018 is greater than the mean 
starting salary in 2017. Use a = .01 as the level of significance. 

DATA fi le 54. British Men’s Age at Marriage. Data from the Office for National Statistics show 

that the mean age at which men in Great Britain get married was 30.8 years in 2013. 

A news reporter noted that this represents a continuation of the trend of waiting until 
a later age to wed. A new sample of 47 recently wed British men provided their age at 
the time of marriage. These data are contained in the file BritainMarriages. Do these 
data indicate that the mean age of British men at the time of marriage exceeds the 
mean age in 2013? Test this hypothesis at a = .05. What is your conclusion? 


(Q 


BritainMarriages 
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Ss DATA fi le 55. Wages of Workers Without High School Diploma. SmartAsset reports that the 

— average weekly earnings for workers who have not received a high school diploma is 
$493 in 2018. Suppose you would like to determine if the average weekly earnings for 
workers who have received a high school diploma is significantly greater than aver- 
age weekly earnings for workers who have not received a high school diploma. Data 
providing the weekly pay for a sample of 50 workers who have received a high school 
diploma are available in the file WeeklyHSGradPay. These data are consistent with the 
findings reported by SmartAsset. 

a. State the hypotheses that should be used to test whether the mean weekly pay for 
workers who have received a high school diploma is significantly greater than the 
mean weekly pay for workers who have not received a high school diploma. 

b. Use the data in the file WeeklyHSGradPay to compute the sample mean, the test 
statistic, and the p-value. 

c. Use a = .05. What is your conclusion? Is this result surprising? Why did these data 
likely lead to this conclusion? 

56. Residential Property Values. The chamber of commerce of a Florida Gulf Coast 
community advertises that area residential property is available at a mean cost of 
$125,000 or less per lot. Suppose a sample of 32 properties provided a sample mean 
of $130,000 per lot and a sample standard deviation of $12,500. Use a .05 level of 
significance to test the validity of the advertising claim. 

57. Length of Time to Sell a Home. According to the National Association of Realtors, it 
took an average of three weeks to sell a home in 2017. Data for the sale of 40 ran- 
domly selected homes sold in Greene County, Ohio, in 2017 showed a sample mean of 
3.6 weeks with a sample standard deviation of 2 weeks. Conduct a hypothesis test to 
determine whether the number of weeks until a house sold in Greene County differed 
from the national average in 2017. Use a = .05 for the level of significance, and state 
your conclusion. 

58. Sleeping on Flights. According to Expedia, 52% of Americans report that they gen- 
erally can sleep during flights. Are people who fly frequently more likely to be able to 
sleep during flights? Suppose we have a random sample of 510 individuals who flew at 
least 25,000 miles last year and 285 indicated that they were able to sleep during flights. 
a. Conduct a hypothesis test to determine whether the results justify concluding that 

people who fly frequently are more likely to be able to sleep during flights. Use 
a = .05. 

b. Conduct the same hypothesis test you performed in part (a) at a = .01. What is 
your conclusion? 

59. Using Laptops on Flights. An airline promotion to business travelers is based on the 
assumption that two-thirds of business travelers use a laptop computer on overnight 
business trips. 

a. State the hypotheses that can be used to test the assumption. 

b. What is the sample proportion from an American Express sponsored survey that found 
355 of 546 business travelers use a laptop computer on overnight business trips? 

c. What is the p-value? 

d. Use a = .05. What is your conclusion? 

60. Millennial Dependency on Parents. Members of the millennial generation are 
continuing to be dependent on their parents (either living with or otherwise receiving 
support from parents) into early adulthood. A family research organization has claimed 
that, in past generations, no more than 30% of individuals aged 18 to 32 continued to 
be dependent on their parents. Suppose that a sample of 400 individuals aged 18 to 
32 showed that 136 of them continue to be dependent on their parents. 

a. Develop hypotheses for a test to determine whether the proportion of millennials 
continuing to be dependent on their parents is higher than for past generations. 

b. What is your point estimate of the proportion of millennials that are continuing to 
be dependent on their parents? 


WeeklyHSGradPay 
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c. What is the p-value provided by the sample data? 

d. What is your hypothesis testing conclusion? Use a = .05 as the level of significance. 

61. Using Social Media in a Job Search. According to Inc.com, 79% of job seekers used 
social media in their job search in 2018. Many believe this number is inflated by the 
proportion of 22- to 30-year-old job seekers who use social media in their job search. 
A survey of 22- to 30-year-old job seekers showed that 310 of the 370 respondents use 
social media in their job search. In addition, 275 of the 370 respondents indicated they 
have electronically submitted a resume to an employer. 

a. Conduct a hypothesis test to determine if the results of the survey justify conclud- 
ing the proportion of 22- to 30-year-old job seekers who use social media in their 
job search exceeds the proportion of the population that use social media in their 
job search. Use a = .05. 

b. Conduct a hypothesis test to determine if the results of the survey justify con- 
cluding that more than 70% of 22- to 30-year-old job seekers have electronically 
submitted a resume to an employer. Using a = .05, what is your conclusion? 

62. Hotel Availability over Holiday Weekend. A radio station in Myrtle Beach announced 
that at least 90% of the hotels and motels would be full for the Memorial Day week- 
end. The station advised listeners to make reservations in advance if they planned to 
be in the resort over the weekend. On Saturday night a sample of 58 hotels and motels 
showed 49 with a no-vacancy sign and 9 with vacancies. What is your reaction to the 
radio station’s claim after seeing the sample evidence? Use a = .05 in making the 
statistical test. What is the p-value? 

63. Vegetarianism in the United States. Vegetarians are much less common in the United 
States than in the rest of the world. In a 2018 survey of 11,000 people in the United 
States, VeganBits found 55 who are vegetarians. 

a. Develop a point estimate of the proportion of people in the United States who are 
vegetarians. 

b. Set up a hypothesis test so that the rejection of H, will allow you to conclude that 
the proportion of people in the United States who are vegetarians exceeds .004. 

c. Conduct your hypothesis test using a = .05. What is your conclusion? 

LA . 64. Time Spent Channel Surfing. According to Ericsson’s 2016 ConsumerLab TV & 

© DATA file Medi l l 

edia report, the average person in the United States spends 23 minutes per day chan- 
nel surfing. The file ChannelSurfing provides the number of minutes per day looking 
for something to watch on television for a random sample 8783 people in December. 
Do these data support the conclusion that people spend less time channel surfing 
during December than they do throughout the year? Test this hypothesis at a = .01. 
Discuss the practical significance of the results. 

65. Potato Chip Quality Control. NDC Technology’s MM710e On-Line Snacks Gauge 
rapidly measures surface brownness of potato chips just before packaging. This allows 
for a high degree of control over this important characteristic of a potato chip; chips 
that are too brown are overfried, and chips that are not sufficiently brown are under- 
fried. A potato chip manufacturer is now using the MM710e to assess the quality of the 
chips it produces; one of this manufacturer’s goals is to produce less than 1 overfried 
chip in every 1000 chips. In a recent random sample of 111,667 chips taken from the 
production lines of the manufacturer’s production facilities nationwide, the MM710e 
found 98 overfried chips. Conduct a hypothesis test to determine if the sample data 
indicates the manufacturer is meeting its goal for overfried chips at a = .05. 

& . 66. TSA Security Line Wait Times. According to the U.S. Transportation Security 

= DATA file Adanan ia 

Ss ministration (TSA), 2% of the 771,556,886 travelers who utilized 440 federal- 

ized airports in 2017 waited more than 20 minutes in the TSA security line. The file 

TSAWaitTimes contains waiting times in TSA security lines at a major U.S. airport for 

a recent random sample 10,531 travelers. Use these data to test the hypothesis that 

the proportion of travelers waiting more than 20 minutes in TSA security lines at this 

airport is the same as the national proportion at a = .05. 


ChannelSurfing 


TSAWaitTimes 
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CASE PROBLEM 1: QUALITY ASSOCIATES, INC. 


Quality Associates, Inc., a consulting firm, advises its clients about sampling and statistical 
procedures that can be used to control their manufacturing processes. In one particular ap- 
plication, a client gave Quality Associates a sample of 800 observations taken during a time 
in which that client’s process was operating satisfactorily. The sample standard deviation 
for these data was .21; hence, with so much data, the population standard deviation was 
assumed to be .21. Quality Associates then suggested that random samples of size 30 be 
taken periodically to monitor the process on an ongoing basis. By analyzing the new sam- 
ples, the client could quickly learn whether the process was operating satisfactorily. When 
the process was not operating satisfactorily, corrective action could be taken to eliminate 
the problem. The design specification indicated the mean for the process should be 12. The 
hypothesis test suggested by Quality Associates follows. 


Ay: u = 12 
H,: w # 12 


Corrective action will be taken any time H) is rejected. 

The samples on the following page were collected at hourly intervals during the first 
day of operation of the new statistical process control procedure. These data are available 
in the file Quality. 


Managerial Report 


1. Conduct a hypothesis test for each sample at the .01 level of significance and determine 
what action, if any, should be taken. Provide the test statistic and p-value for each test. 

2. Compute the standard deviation for each of the four samples. Does the assumption of 
.21 for the population standard deviation appear reasonable? 

3. Compute limits for the sample mean x around u = 12 such that, as long as a new sample 
mean is within those limits, the process will be considered to be operating satisfactorily. If 
x exceeds the upper limit or if x is below the lower limit, corrective action will be taken. 
These limits are referred to as upper and lower control limits for quality control purposes. 

4. Discuss the implications of changing the level of significance to a larger value. What 
mistake or error could increase if the level of significance is increased? 


Sample 1 Sample 2 Sample 3 Sample 4 

11.55 11.62 11.91 12.02 

11.62 11.69 11.36 12102 

11.52 11.59 11.75 12.05 

11.75 11.82 11.95 12.18 

11.90 11.97 12.14 12a 

11.64 11.71 1172 12.07 

D ; 11.64 Al fey 2 12.07 
= DATA file 11.80 11.87 11.61 12.05 
Quality 12.03 12.10 11.85 11.64 

11.94 12.01 12.16 12.39 

11.92 11.99 11.91 11.65 

12.13 12.20 1212 12.11 

12.09 12.16 11.61 11.90 

11.93 12.00 1221 1222 

1221 12.28 11.56 11.88 

12.32 12.39 11.95 12.03 
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Sample 1 Sample 2 Sample 3 Sample 4 
1G 12.00 12.01 1235 
ESS) Hy 12.06 12.09 
1E 11.83 1176 IEZ 
1216 12.23 11.82 12.20 
WW 11.84 1212 Hla? 
12.00 12.07 11.60 12.30 
12.04 12.11 1195 12.27 
11.98 12.05 11.96 1229 
12.30 1283) 1222 12.47 
12.18 12.25 S 12.03 
IES 12.04 11.96 1217 
1217 12.24 11S 11.94 
11.85 192 11.89 EO, 
12.30 1237 11.88 12.23 
1215 1222 113 1225 


CASE PROBLEM 2: ETHICAL BEHAVIOR OF 
BUSINESS STUDENTS AT BAYVIEW UNIVERSITY 


During the global recession of 2008 and 2009, there were many accusations of unethical 
behavior by Wall Street executives, financial managers, and other corporate officers. At that 
time, an article appeared that suggested that part of the reason for such unethical business 
behavior may stem from the fact that cheating has become more prevalent among business 
students (Chronicle of Higher Education). The article reported that 56% of business stu- 
dents admitted to cheating at some time during their academic career as compared to 47% 
of nonbusiness students. 

Cheating has been a concern of the dean of the College of Business at Bayview Univer- 
sity for several years. Some faculty members in the college believe that cheating is more 
widespread at Bayview than at other universities, whereas other faculty members think that 
cheating is not a major problem in the college. To resolve some of these issues, the dean 
commissioned a study to assess the current ethical behavior of business students at Bayview. 
As part of this study, an anonymous exit survey was administered to a sample of 90 business 
students from this year’s graduating class. Responses to the following questions were used to 
obtain data regarding three types of cheating. 


During your time at Bayview, did you ever present work copied off the Internet as your 


own? 
Yes No 

During your time at Bayview, did you ever copy answers off another student’s exam? 
Yes No 


During your time at Bayview, did you ever collaborate with other students on projects 
that were supposed to be completed individually? 


Yes No 
Any student who answered Yes to one or more of these questions was considered to have 


been involved in some type of cheating. A portion of the data collected follows. The com- 
plete data set is in the file Bayview. 
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(@ 


Copied Collaborated 
from Copied on Individual 
Student Internet on Exam Project Gender 
: 1 No No No Female 
DATA file 
2 No No No Male 
Bayview 
3 Yes No Yes Male 
4 Yes Yes No Male 
5 No No Yes Male 
6 Yes No No Female 
88 No No No Male 
89 No Yes Yes Male 
90 No No No Female 


Managerial Report 

Prepare a report for the dean of the college that summarizes your assessment of the nature 
of cheating by business students at Bayview University. Be sure to include the following 
items in your report. 


1. Use descriptive statistics to summarize the data and comment on your findings. 

2. Develop 95% confidence intervals for the proportion of all students, the proportion of 
male students, and the proportion of female students who were involved in some type of 
cheating. 

3. Conduct a hypothesis test to determine whether the proportion of business students at 
Bayview University who were involved in some type of cheating is less than that of business 
students at other institutions as reported by the Chronicle of Higher Education. 

4. Conduct a hypothesis test to determine whether the proportion of business students at 
Bayview University who were involved in some form of cheating is less than that of non- 
business students at other institutions as reported by the Chronicle of Higher Education. 

5. What advice would you give to the dean based upon your analysis of the data? 
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STATISTICS IN PRACTICE 


Inference About Means and Proportions with Two Populations 


U.S. Food and Drug Administration 
WASHINGTON, D.C. 


It is the responsibility of the U.S. Food and Drug Ad- 
ministration (FDA), through its Center for Drug Evalua- 
tion and Research (CDER), to ensure that drugs are safe 
and effective. But CDER does not do the actual testing 
of new drugs itself. It is the responsibility of the com- 
pany seeking to market a new drug to test it and submit 
evidence that it is safe and effective. CDER statisticians 
and scientists then review the evidence submitted. 

Companies seeking approval of a new drug con- 
duct extensive statistical studies to support their 
application. The testing process in the pharmaceutical 
industry usually consists of three stages: (1) preclinical 
testing, (2) testing for long-term usage and safety, and 
(3) clinical efficacy testing. At each successive stage, 
the chance that a drug will pass the rigorous tests 
decreases; however, the cost of further testing increases 
dramatically. Industry surveys indicate that on average 
the research and development for one new drug costs 
$250 million and takes 12 years. Hence, it is important 
to eliminate unsuccessful new drugs in the early stages 
of the testing process, as well as to identify promising 
ones for further testing. 

Statistics plays a major role in pharmaceutical 
research, where government regulations are stringent 
and rigorously enforced. In preclinical testing, a two- or 
three-population statistical study typically is used to 
determine whether a new drug should continue to be 
studied in the long-term usage and safety program. The 
populations may consist of the new drug, a control, and 
a standard drug. The preclinical testing process begins 
when a new drug is sent to the pharmacology group 
for evaluation of efficacy—the capacity of the drug to 
produce the desired effects. As part of the process, a 
statistician is asked to design an experiment that can 
be used to test the new drug. The design must specify 
the sample size and the statistical methods of analysis. 
In a two-population study, one sample is used to obtain 
data on the efficacy of the new drug (population 1) and 
a second sample is used to obtain data on the efficacy 
of a standard drug (population 2). Depending on the 


In Chapters 8 and 9 we 
showed how to develop 
interval estimates and 
conduct hypothesis tests for 
situations involving a single 


Statistical methods are used to test and develop 
new drugs. © Lisa S./ShutterStock.com 


intended use, the new and standard drugs are tested 

in such disciplines as neurology, cardiology, and immu- 
nology. In most studies, the statistical method involves 
hypothesis testing for the difference between the means 
of the new drug population and the standard drug pop- 
ulation. If a new drug lacks efficacy or produces undesir- 
able effects in comparison with the standard drug, the 
new drug is rejected and withdrawn from further testing. 
Only new drugs that show promising comparisons with 
the standard drugs are forwarded to the long-term usage 
and safety testing program. 

Further data collection and multipopulation studies 
are conducted in the long-term usage and safety test- 
ing program and in the clinical testing programs. The 
FDA requires that statistical methods be defined prior 
to such testing to avoid data-related biases. In addi- 
tion, to avoid human biases, some of the clinical trials 
are double or triple blind. That is, neither the subject 
nor the investigator knows what drug is administered 
to whom. If the new drug meets all requirements in rela- 
tion to the standard drug, a new drug application (NDA) 
is filed with the FDA. The application is rigorously scru- 
tinized by statisticians and scientists at the agency. 

In this chapter you will learn how to construct interval 
estimates and make hypothesis tests about means and 
proportions with two populations. Techniques will be 
presented for analyzing independent random samples 
as well as matched samples. 


In this chapter, we extend the discussion of statistical inference beyond single sample 
analyses of a population mean or population proportion by showing how interval estimates 
and hypothesis tests can be developed for situations involving two populations when the 
difference between the two population means or the two population proportions is of prime 
importance. For example, we may want to develop an interval estimate of the difference 


population mean and a sin- between the mean starting salary for a population of men and the mean starting salary for 


gle population proportion. 
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a population of women or conduct a hypothesis test to determine whether any difference 
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is present between the proportion of defective parts in a population of parts produced 
by supplier A and the proportion of defective parts in a population of parts produced by 
supplier B. We begin our discussion of statistical inference about two populations by 
showing how to develop interval estimates and conduct hypothesis tests about the dif- 
ference between the means of two populations when the standard deviations of the two 
populations can be assumed known. 


10.1 Inferences About the Difference Between 
Two Population Means: g and øo, Known 


Letting u, denote the mean of population 1 and m, denote the mean of population 2, we 
will focus on inferences about the difference between the means: u, — m. To make an 
inference about this difference, we select a random sample of n, units from population 1 
and a second random sample of n, units from population 2. The two samples, taken sepa- 
rately and independently, are referred to as independent random samples. In this section, 
we assume that information is available such that the two population standard deviations, 
g and a, can be assumed known prior to collecting the samples. We refer to this situation 
as the a, and ø, known case. In the following example we show how to compute a margin 
of error and develop an interval estimate of the difference between the two population 
means when g, and g, are known. 


Interval Estimation of 4, — 2 


HomeStyle sells furniture at two stores in Buffalo, New York: One is in the inner city and 
the other is in a suburban shopping center. The regional manager noticed that products 
that sell well in one store do not always sell well in the other. The manager believes this 
situation may be attributable to differences in customer demographics at the two loca- 
tions. Customers may differ in age, education, income, and so on. Suppose the manager 
asks us to investigate the difference between the mean ages of the customers who shop at 
the two stores. 

Let us define population 1 as all customers who shop at the inner-city store and popula- 
tion 2 as all customers who shop at the suburban store. 


mean of population 1 (i.e., the mean age of all customers 
who shop at the inner-city store) 


By 


W = mean of population 2 (i.e., the mean age of all customers 
who shop at the suburban store) 


The difference between the two population means is u~ pM». 

To estimate jz, — u, we will select a random sample of n; customers from population 1 
and a random sample of n, customers from population 2. We then compute the two sample 
means. 


xX, = sample mean age for the random sample of n, inner-city customers 
x, = sample mean age for the random sample of n, suburban customers 


The point estimator of the difference between the two population means is the difference 
between the two sample means. 


POINT ESTIMATOR OF THE DIFFERENCE BETWEEN TWO POPULATION MEANS 
%, — X (10.1) 


Figure 10.1 provides an overview of the process used to estimate the difference between 
two population means based on two independent random samples. 
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The standard error of 

X, — X, is the standard 
deviation of the sampling 
distribution of X, — Xp. 


The margin of error is given 
by multiplying the standard 


error by Z,- 
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Chapter 10 Inference About Means and Proportions with Two Populations 


FIGURE 10.1 


Estimating the Difference Between Two Population Means 


Population 2 
Suburban Store Customers 


Population 1 
Inner-City Store Customers 


H2 = mean age of suburban 
store customers 


Hı = mean age of inner-city 
store customers 


lı — My = difference between the mean ages 
Two Independent Random Samples 


Random sample of 
n, suburban customers 


Random sample of 
n; inner-city customers 


X = sample mean age for the 
suburban store customers 


xX, = sample mean age for the 
inner-city store customers 


X, — X% = Point estimator of my — by 


As with other point estimators, the point estimator x, — x, has a standard error that 
describes the variation in the sampling distribution of the estimator. With two independent 
random samples, the standard error of x, — x, is as follows. 


STANDARD ERROR OF x, — x, 


(10.2) 


If both populations have a normal distribution, or if the sample sizes are large enough that 
the central limit theorem enables us to conclude that the sampling distributions of x, and x, 
can be approximated by a normal distribution, the sampling distribution of x, — x, will have 
anormal distribution with mean given by m; — ps. 

In general, an interval estimate is given by a point estimate + a margin of error. In the 
case of estimation of the difference between two population means, an interval estimate 
will take the following form: 


xı — xX, + Margin of error 


With the sampling distribution of x, — x, having a normal distribution, we can write the 
margin of error as follows: 


Go, 
Margin of error = 2,02 = = Zan 4 | — + (10.3) 
1 TAA = nı n, 


Thus the interval estimate of the difference between two population means is as follows. 


INTERVAL ESTIMATE OF THE DIFFERENCE BETWEEN TWO POPULATION MEANS: 
ao, AND o, KNOWN 


= se 
X T M = Zan 


(10.4) 


where | — a is the confidence coefficient. 
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Let us return to the HomeStyle example. Based on data from previous customer demo- 
graphic studies, the two population standard deviations are known with o, = 9 years and 
©, = 10 years. The data collected from the two independent random samples of HomeStyle 
customers provided the following results. 


ae i Inner City Store Suburban Store 
== DATA file l y 
HomeStyle Sample Size n, = 36 n = 49 
Sample Mean X, = 40 years X, = 35 years 


Using expression (10.1), we find that the point estimate of the difference between the 
mean ages of the two populations is x, — x, = 40 — 35 = 5 years. Thus, we estimate that the 
customers at the inner-city store have a mean age five years greater than the mean age of the 
suburban store customers. We can now use expression (10.4) to compute the margin of error 
and provide the interval estimate of u, — u. Using 95% confidence and z,,. = Z 995 = 1.96, 


we have 
__, [fa 
Xi T A = Zan n Ny 
9? 10° 
40 — 35 + 1.964 /— + — 
36 49 


5 + 4.06 


Thus, the margin of error is 4.06 years and the 95% confidence interval estimate of the differ- 
ence between the two population means is 5 — 4.06 = .94 years to 5 + 4.06 = 9.06 years. 


Using Excel to Construct a Confidence Interval 


Excel’s data analysis tools do not provide a procedure for developing interval estimates 
involving two population means. However, we can develop an Excel worksheet that can 

be used as a template to construct interval estimates. We will illustrate by constructing 

an interval estimate of the difference between the population means in the HomeStyle 
Furniture Stores study. Refer to Figure 10.2 as we describe the tasks involved. The formula 
worksheet is in the background; the value worksheet is in the foreground. 


Enter/Access Data: Open the file HomeStyle. Column A contains the age data and a label 
for the random sample of 36 inner-city customers, and column B contains the age data and 
a label for the random sample of 49 suburban customers. 


Enter Functions and Formulas: The descriptive statistics needed are provided in cells E5:F6. 
The known population standard deviations are entered into cells E8 and F8. Using the two 
population standard deviations and the sample sizes, the standard error of the point estimator 
X, — xX, is computed using equation (10.2) by entering the following formula into cell E9: 


=SORT(E8"2/E5 + F82/F5) 


Cells E11:E14 are used to compute the appropriate z value and the margin of error. By 
entering .95 into cell E12 for the value of the confidence coefficient, the corresponding level 
of significance (a = 1 — confidence coefficient) is computed in cell E12. In cell E13, we 
used the NORM.S.INV function to compute the z value needed for the interval estimate. The 
margin of error is computed in cell E14 by multiplying the z value by the standard error. 

In cell E16 the difference in the sample means is used to compute the point estimate of 
the difference in the two population means. The lower limit of the confidence interval is 
computed in cell E17 (.94) and the upper limit is computed in cell E18 (9.06); thus, the 95% 
confidence interval estimate of the difference in the two population means is .94 to 9.06. 
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FIGURE 10.2 | Excel Worksheet: Constructing a 95% Confidence Interval 
for Homestyle Furniture Stores 


Py" y B c D E F G H 
£ Suburban Interval Estimate of Difference in Population Means: 
2 Bs D» G, and c, Known Case 
3 “ 3 
4 n 39 Inner City Suburban 
sE a N Sample Size “COUNT(A2:A37) =COUNT(B2:B50) 
6! 3 37 Sample Mean AVERAGE(A2:A37)  "AVERAGE(B2:B50) 
7] o 2 
8] 33040 Population Standard Deviation 9 10 
2 | z a Standard Error =SQRT(ES'VESFSYFS = = z 7 z F 5 
uo a 3 Confidence Coefficient 0.95 1 Ea aasi Interval Estimate of Difference in Population Means: 
12) n 28 Level of Significance =1-E11 2 | 38 3 o, and o, Known Case 
13 E OT z Value “NORMSINV(I-E12/2) 3 6 3 
14 “St Margin of Error =E13°E9 + 2 ë » Inner City Suburban 
15. “v B sj 3 10 Sample Size 36 49 
16 s 25 Point Estimate of Difference =E6-F6 6) 39 37 Sample Mean 40 35 
vi eR Lower Limit *E16-E14 71 @ a 
18 38 38 Upper Limit *E16+E14 s| 35 40 Population Standard Deviation 9 10 
36 “u 19 a 3s 37 Standard Error 2.07 
37} a 0 10° 3% 8 
49 2 uj KT E Confidence Coefficient 095 
$0 a Ri 32 28 Level of Significance 0.05 
sı 13) 38 37 2 Value 1.960 
iy) “u s Margin of Error 4.06 
si w B 
Note: Rows 19-35 16 a Fs Point Estimate of Difference s 
17) 9 37 Lower Limit 0.94 
and 38-48 are 18 . z Upper Limit 9.06 
36 “ 
hidden. 37) a 40 
3 2 
s 7 


A Template for Other Problems This worksheet can be used as a template for develop- 
ing interval estimates of the difference in population means when the population standard 
deviations are assumed known. For another problem of this type, we must first enter the 
new problem data in columns A and B. The data ranges in cells E5:F6 must be modified in 
order to compute the sample means and sample sizes for the new data. Also, the assumed 
known population standard deviations must be entered into cells E8 and F8. After doing so, 
the point estimate and a 95% confidence interval will be displayed in cells E16:E18. If a 
confidence interval with a different confidence coefficient is desired, we simply change the 
value in cell E11. 

We can further simplify the use of Figure 10.2 as a template for other problems by 
eliminating the need to enter new data ranges in cells E5:F6. We rewrite the cell formulas 


as follows: 

Cell E5: =COUNT(A:A) 

Cell F5: =COUNT(B:B) 

Cell E6: =AVERAGE(A:A) 

Cell F6: =AVERAGE(B:B) 
The file HomeStyle Using the A:A method of specifying data ranges in cells E5 and E6, Excel’s COUNT 
includes a worksheet function will count the number of numerical values in column A and Excel’s AVERAGE 
titled Template that uses function will compute the average of the numerical values in column A. Similarly, using 
the A:A and B:B meth- the B:B method of specifying data ranges in cells F5 and F6, Excel’s COUNT function will 
ods for entering the count the number of numerical values in column B and Excel’s AVERAGE function will 
data ranges. compute the average of the numerical values in column B. Thus, to solve a new problem it 


is only necessary to enter the new data into columns A and B and enter the known popula- 
tion standard deviations in cells E8 and F8. 

This worksheet can also be used as a template for text exercises in which the sample 
sizes, sample means, and population standard deviations are given. In this type of situa- 
tion, no change in the data is necessary. We simply replace the values in cells E5:F6 and 
E8:F8 with the given values of the sample sizes, sample means, and population standard 
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Chapter 9 introduces the 
general steps for hypothe- 
sis testing for a single pop- 
ulation mean and single 
population proportion. 
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deviations. If something other than a 95% confidence interval is desired, the confidence 
coefficient in cell E11 must also be changed. 


Hypothesis Tests About u4 — u3 


Let us consider hypothesis tests about the difference between two population means. Using 
D, to denote the hypothesized difference between u, and u, the three forms for a hypothe- 
sis test are as follows: 


Ho: by = My = Do Ho: by = By = Do Ho: by = My = Do 
Hx M = Wy < Do Hx Mı = m > Do Hx by = Wy * Do 


In many applications, D, = 0. Using the two-tailed test as an example, when D, = 0 the 
null hypothesis is Hy: w, — m, = O. In this case, the null hypothesis is that u, and m, are 
equal. Rejection of H} leads to the conclusion that H,: u, — p, ~ 0 is true; that is, w, and 
p are not equal. 

The general steps for conducting hypothesis tests are still applicable here. We must 
choose a level of significance, compute the value of the test statistic, and find the p-value 
to determine whether the null hypothesis should be rejected. With two independent random 
samples, we showed that the point estimator x, — x, has a standard error 0; _;, given by 
expression (10.2) and, when the sample sizes are large enough, the distribution of x, — x, 
can be described by a normal distribution. In this case, the test statistic for the difference 
between two population means when g; and g, are known is as follows. 


TEST STATISTIC FOR HYPOTHESIS TESTS ABOUT py, — mz: 0, AND o, KNOWN 


&. (x; $ X) = Do (10.5) 
o o l 
nm ty 


Let us demonstrate the use of this test statistic in the following hypothesis testing example. 

As part of a study to evaluate differences in education quality between two training 
centers, a standardized examination is given to individuals who are trained at the centers. 
The difference between the mean examination scores is used to assess quality differences 
between the centers. The population means for the two centers are as follows. 


H, = the mean examination score for the population 
of individuals trained at center A 


W = the mean examination score for the population 
of individuals trained at center B 


We begin with the tentative assumption that no difference exists between the training 
quality provided at the two centers. Hence, in terms of the mean examination scores, the 
null hypothesis is that u; — m, = 0. If sample evidence leads to the rejection of this hy- 
pothesis, we will conclude that the mean examination scores differ for the two populations. 
This conclusion indicates a quality differential between the two centers and suggests that a 
follow-up study investigating the reason for the differential may be warranted. The null and 
alternative hypotheses for this two-tailed test are written as follows. 


Ay: by = m = 0 

A: My — By ~ 0 
The standardized examination given previously in a variety of settings always resulted in 
an examination score standard deviation near 10 points. Thus, we will use this information 


to assume that the population standard deviations are known with o, = 10 and o, = 10. An 
a = .05 level of significance is specified for the study. 
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(@ 


DATA fil Independent random samples of n, = 30 individuals from training center A and n, = 40 
I ne individuals from training center B are taken. The respective sample means are x, = 82 and 

xX, = 78. Do these data suggest a significant difference between the population means at 

the two training centers? To help answer this question, we compute the test statistic using 


ExamScores 


equation (10.5). 
Xx, — x) —D 82 — 78) — 0 
yee ae = a 
a 5 10 10 
n Ny 30 40 


Next let us compute the p-value for this two-tailed test. Because the test statistic z is in 
the upper tail, we first compute the upper tail area corresponding to z = 1.66. Using the 
standard normal distribution table, the area to the left of z = 1.66 is .9515. Thus, the area 
in the upper tail of the distribution is 1.0000 — .9515 = .0485. Because this test is a two- 
tailed test, we must double the tail area: p-value = 2(.0485) = .0970. Following the usual 
rule to reject H if p-value = a, we see that the p-value of .0970 does not allow us to reject 
H, at the .05 level of significance. The sample results do not provide sufficient evidence to 
conclude that the training centers differ in quality. 

In this chapter we will use the p-value approach to hypothesis testing. However, if you 
prefer, the test statistic and the critical value rejection rule may be used. With a = .05 and 
Zu = Zo25 = 1.96, the rejection rule employing the critical value approach would be reject Hp 
if z= —1.96 or if z = 1.96. With z = 1.66, we reach the same do not reject H, conclusion. 

In the preceding example, we demonstrated a two-tailed hypothesis test about the differ- 
ence between two population means. Lower tail and upper tail tests can also be considered. 
These tests use the same test statistic as given in equation (10.5). The procedure for com- 
puting the p-value and the rejection rules for these one-tailed tests are the same as those for 
hypothesis tests involving a single population mean and single population proportion. 


Using Excel to Conduct a Hypothesis Test 


Excel’s z-Test: Two Sample for Means tool can be used to conduct the hypothesis test to 
determine whether there is a significant difference in population means when g, and o, 

are assumed known. We illustrate using the sample data for exam scores at center A and at 
center B. With an assumed known standard deviation of 10 points at each center, the known 
variance of exam scores for each of the two populations is equal to 10* = 100. Refer to the 
Excel worksheets shown in Figure 10.3 and Figure 10.4 as we describe the tasks involved. 


Enter/Access Data: Open the file ExamScores. Column A in Figure 10.3 contains the ex- 
amination score data and a label for the random sample of 30 individuals trained at center 
A, and column B contains the examination score data and a label for the random sample of 
40 individuals trained at center B. 


Apply Tools: The following steps will provide the information needed to conduct the hy- 
pothesis test to see whether there is a significant difference in test scores at the two centers. 


Step 1. Click the Data tab on the Ribbon 
Step 2. In the Analyze group, click Data Analysis 
Step 3. Choose z-Test: Two Sample for Means from the list of Analysis Tools 
Step 4. When the z-Test: Two Sample for Means dialog box appears (Figure 10.3): 
Enter A/:A3/ in the Variable 1 Range: box 
Enter B/:B4/ in the Variable 2 Range: box 
Enter 0 in the Hypothesized Mean Difference: box 
Enter /00 in the Variable 1 Variance (known): box 
Enter /00 in the Variable 2 Variance (known): box 
Select the check box for Labels 
Enter .05 in the Alpha: box 
Select Output Range: and enter D4 in the box 
Click OK 
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FIGURE 10.3 | Dialog Box for Excel's z-Test: Two Sample for Means Tool 


Input 
Variable 1 Range: $A$1:$A$31 


Variable 2 Range: $B$1:$8$41 


ii 


Hypothesized Mean Difference: 0 


Variable 1 Variance (known): 100 


Variable 2 Variance (known): 100 


g 


Note: Rows 18-28 
and 33-39 are 
hidden. 


Alpha: 0.05 


Output options 

O Output Range: D4 
© New Worksheet Ply: 
© New Workbook 


FIGURE 10.4 | Excel Results for the Hypothesis Test About Equality of Exam Scores 
at Two Training Centers 


Center A |Center B 


Note: Rows 
18-28 and 33-39 
are hidden. 
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The value of the test statis- The results are shown in Figure 10.4. Descriptive statistics for the two samples are shown 
tic shown here (1.6562) and in cells E7:F9. The value of the test statistic, 1.6562, is shown in cell E11. The p-value for 
the p-value (.0977) differ the test, labeled “P(Z< =z) two-tail’”, is shown in cell E14. Because the p-value, .0977, is 
slightly from those shown greater than the level of significance, a = .05, we cannot conclude that the means for the 


previously, because we two populations are different. 

rounded the test statistic The z-Test: Two Sample for Means tool can also be used to conduct one-tailed hypoth- 
to two places (1.66) in esis tests. The only change required to make the hypothesis testing decision is that we need 
the text. to use the p-value for a one-tailed test, labeled “P(Z< =z) one-tail” (see cell E12). 


Practical Advice 


In most applications of the interval estimation and hypothesis testing procedures presented 
in this section, random samples with n, = 30 and n, = 30 are adequate. In cases where 
either or both sample sizes are less than 30, the distributions of the populations become 
important considerations. In general, with smaller sample sizes, it is more important for 
the analyst to be satisfied that it is reasonable to assume that the distributions of the two 
populations are at least approximately normal. 


Methods 
1. The following results come from two independent random samples taken of two 
populations. 


Sample 1 Sample 2 
n = 50 n = 35 
a= Ki = 116 
a, = 2.2 o, = 3.0 


a. What is the point estimate of the difference between the two population means? 
b. Provide a 90% confidence interval for the difference between the two population 
means. 
c. Provide a 95% confidence interval for the difference between the two population 
means. 
2. Consider the following hypothesis test. 


Ay: by = m, 50 
H; by — By > 0 


The following results are for two independent samples taken from the two populations. 


Sample 1 Sample 2 
n, = 40 n, = 50 
X, = 25.2 % = 22.8 
a, =5.2 o, = 6.0 


a. What is the value of the test statistic? 

b. What is the p-value? 

c. With a = .05, what is your hypothesis testing conclusion? 
3. Consider the following hypothesis test. 


A: by — My = 0 
Hx My — By #0 
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The following results are for two independent samples taken from the two populations. 


Sample 1 Sample 2 
n, = 80 n= 70 
X, = 104 X = 106 
o, =8.4 a= fo 


a. What is the value of the test statistic? 
b. What is the p-value? 
c. With a = .05, what is your hypothesis testing conclusion? 


Applications 

4. Cruise Ship Ratings. Condé Nast Traveler conducts an annual survey in which readers 
rate their favorite cruise ship. All ships are rated on a 100-point scale, with higher values 
indicating better service. A sample of 37 ships that carry fewer than 500 passengers 
resulted in an average rating of 85.36, and a sample of 44 ships that carry 500 or more 
passengers provided an average rating of 81.40. Assume that the population standard 
deviation is 4.55 for ships that carry fewer than 500 passengers and 3.97 for ships that 
carry 500 or more passengers. 

a. What is the point estimate of the difference between the population mean rating for 
ships that carry fewer than 500 passengers and the population mean rating for ships 
that carry 500 or more passengers? 

b. At 95% confidence, what is the margin of error? 

c. What is a 95% confidence interval estimate of the difference between the popula- 
tion mean ratings for the two sizes of ships? 

5. Valentine’s Day Expenditures. USA Today reports that the average expenditure on 
Valentine’s Day is $100.89. Do male and female consumers differ in the amounts 
they spend? The average expenditure in a sample survey of 40 male consumers was 
$135.67, and the average expenditure in a sample survey of 30 female consumers 
was $68.64. Based on past surveys, the standard deviation for male consumers is 
assumed to be $35, and the standard deviation for female consumers is assumed to 
be $20. 

a. What is the point estimate of the difference between the population mean expendi- 
ture for males and the population mean expenditure for females? 

b. At 99% confidence, what is the margin of error? 

c. Develop a 99% confidence interval for the difference between the two population 
means. 

6. Hotel Price Comparison. Suppose that you are responsible for making arrange- 
ments for a business convention and that you have been charged with choosing a city 


me) ; for the convention that has the least expensive hotel rooms. You have narrowed your 

== DATA file : . . ; 

= choices to Atlanta and Houston. The file Hotel contains samples of prices for rooms 
Hotel in Atlanta and Houston that are consistent with a SmartMoney survey conducted by 


Smith Travel Research. Because considerable historical data on the prices of rooms 
in both cities are available, the population standard deviations for the prices can be 
assumed to be $20 in Atlanta and $25 in Houston. Based on the sample data, can 
you conclude that the mean price of a hotel room in Atlanta is lower than one in 
Houston? 

7. Supermarket Customer Satisfaction. Consumer Reports uses a survey of readers to 
obtain customer satisfaction ratings for the nation’s largest supermarkets (Consumer 
Reports website). Each survey respondent is asked to rate a specified supermarket 
based on a variety of factors such as quality of products, selection, value, checkout ef- 
ficiency, service, and store layout. An overall satisfaction score summarizes the rating 
for each respondent with 100 meaning the respondent is completely satisfied in terms 
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of all factors. Sample data representative of independent samples of Publix and Trader 
Joe’s customers are shown below. 


Publix Trader Joe's 
n, = 250 nz = 300 
X, = 86 % = 85 


a. Formulate the null and alternative hypotheses to test whether there is a difference 
between the population mean customer satisfaction scores for the two retailers. 

b. Assume that experience with the Consumer Reports satisfaction rating scale 
indicates that a population standard deviation of 12 is a reasonable assumption for 
both retailers. Conduct the hypothesis test and report the p-value. At a .05 level of 
significance what is your conclusion? 

c. Which retailer, if either, appears to have the greater customer satisfaction? Provide 
a 95% confidence interval for the difference between the population mean customer 
satisfaction scores for the two retailers. 

8. Increases in Customer Satisfaction. Will improving customer service result in higher 
stock prices for the companies providing the better service? “When a company’s 
satisfaction score has improved over the prior year’s results and is above the national 
average (75.7), studies show its shares have a good chance of outperforming the broad 
stock market in the long run.” The following satisfaction scores of three companies for 
the fourth quarters of two previous years were obtained from the American Customer 
Satisfaction Index. Assume that the scores are based on a poll of 60 customers from 
each company. Because the polling has been done for several years, the standard devi- 
ation can be assumed to equal 6 points in each case. 


Company Year 1 Score Year 2 Score 
Rite Aid 73 76 
Expedia Z5 77 
J.C. Penney I 78 


a. For Rite Aid, is the increase in the satisfaction score from Year 1 to Year 2 statisti- 
cally significant? Use a = .05. What can you conclude? 

b. Can you conclude that the Year 2 score for Rite Aid is above the national average of 
75.7? Use a = .05. 

c. For Expedia, is the increase from Year | to Year 2 statistically significant? Use 
a = .05. 

d. When conducting a hypothesis test with the values given for the standard deviation, 
sample size, and a, how large must the increase from Year | to Year 2 be for it to be 
statistically significant? 

e. Use the result of part (d) to state whether the increase for J.C. Penney from Year 1 
to Year 2 is statistically significant. 


10.2 Inferences About the Difference Between 
Two Population Means: a, and o, Unknown 


In this section we extend the discussion of inferences about the difference between two 
population means to the case when the two population standard deviations, o, and o,, are 
unknown. In this case, we will use the sample standard deviations, s, and s,, to estimate 
the unknown population standard deviations. When we use the sample standard deviations, 
the interval estimation and hypothesis testing procedures will be based on the f distribution 
rather than the standard normal distribution. 
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Interval Estimation of u4 — u3 


In the following example we show how to compute a margin of error and develop an inter- 
val estimate of the difference between two population means when g and g, are unknown. 
Clearwater National Bank is conducting a study designed to identify differences between 
checking account practices by customers at two of its branch banks. A random sample of 
28 checking accounts is selected from the Cherry Grove Branch and an independent ran- 
dom sample of 22 checking accounts is selected from the Beechmont Branch. The current 
checking account balance is recorded for each of the checking accounts. A summary of the 
account balances follows: 


Cherry Grove Beechmont 


DATA file Sample Size n = 28 n, = 22 
CheckAcct Sample Mean x, = $1025 X = $910 
Sample Standard Deviation s, = $150 s, = $125 


(@ 


Clearwater National Bank would like to estimate the difference between the mean 
checking account balance maintained by the population of Cherry Grove customers and 
the population of Beechmont customers. Let us develop the margin of error and an interval 
estimate of the difference between these two population means. 

In Section 10.1, we provided the following interval estimate for the case when the 
population standard deviations, 7, and a, are known. 


— y + n 
X T A = Zan m + 


When a, and o; are With g; and o unknown, we will use the sample standard deviations s, and s, to estimate 
estimated by s;and s„ the o, and g, and replace z,,, With t,,,. As a result, the interval estimate of the difference be- 
t distribution is used to tween two population means is given by the following expression. 
make inferences about the 
difference between two 
population means. INTERVAL ESTIMATE OF THE DIFFERENCE BETWEEN TWO POPULATION MEANS: 

ao, AND c, UNKNOWN 


n 2 

ee ee ee 10.6 

X — X= lon Paar (10.6) 
1 2 


where | — a is the confidence coefficient. 


In this expression, the use of the ¢ distribution is an approximation, but it provides excellent 
results and is relatively easy to use. The only difficulty that we encounter in using expression 
(10.6) is determining the appropriate degrees of freedom for t. Statistical software packages 
compute the appropriate degrees of freedom automatically. The formula used is as follows. 


DEGREES OF FREEDOM: t DISTRIBUTION WITH TWO INDEPENDENT 
RANDOM SAMPLES 


df = — ++. (10.7) 
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Let us return to the Clearwater National Bank example and show how to use expression 
(10.6) to provide a 95% confidence interval estimate of the difference between the popula- 
tion mean checking account balances at the two branch banks. The sample data show 
n, = 28, x, = $1025, and s, = $150 for the Cherry Grove branch, and n, = 22, x, = $910, 
and s, = $125 for the Beechmont branch. The calculation for degrees of freedom for t, is 


as follows: 
S gE 150 125V 
=+ + 
n Ny 28 22 


df = —— s SZR 
1 fs P E \7 1 (1502 P 1 (1252\? 
n —1\n ny — 1\ny 28 —1\ 28 2-1\ 2 


We round the noninteger degrees of freedom down to 47 to provide a larger t value and 
a more conservative interval estimate. Using the ¢ distribution table with 47 degrees of 
freedom, we find fọ; = 2.012. Using expression (10.6), we develop the 95% confidence 
interval estimate of the difference between the two population means as follows. 


SIETI FE 
P SH Xe Et 
1 2 .025 ny n, 
150% . 125? 
1025 — 910 + 2.012 + 
025 — 910 0 T T 


115 + 78 


The point estimate of the difference between the population mean checking account 
balances at the two branches is $115. The margin of error is $78, and the 95% confidence 
interval estimate of the difference between the two population means is 115 — 78 = $37 to 
115 + 78 = $193. 


Using Excel to Construct a Confidence Interval 


Excel’s data analysis tools do not provide a procedure for developing interval estimates 
involving two population means. However, we can develop an Excel worksheet that can 
be used as a template to construct interval estimates. We will illustrate by constructing 
an interval estimate of the difference between the population means in the Clearwater 
National Bank study. Refer to Figure 10.5 as we describe the tasks involved. The formula 
worksheet is in the background; the value worksheet is in the foreground. 


Enter/Access Data: Open the file CheckAcct. Column A contains the account balances and 
a label for the random sample of 28 customers at the Cherry Grove Branch, and column 

B contains the account balances and a label for the random sample of 22 customers at the 
Beechmont Branch. 


Enter Functions and Formulas: The descriptive statistics needed are provided in cells E5:F7. 
Using the two sample standard deviations and the sample sizes, an estimate of the variance of 
the point estimator x, — x, is computed by entering the following formula into cell E9: 


=E72/ES + F7^2/F5 


An estimate of the standard error is then computed in cell E10 by taking the square root of 
the variance. 

Cells E12:E16 are used to compute the appropriate t value and the margin of error. By 
entering .95 into cell E12 for the value of the confidence coefficient, the corresponding 
level of significance is computed in cell E13 (œ = .05). In cell E14, we used formula (10.7) 
to compute the degrees of freedom (47.8). In cell E15, we used the T.INV.2T function to 
compute the ¢ value needed for the interval estimate. The margin of error is computed in 
cell E16 by multiplying the ¢ value by the standard error. 
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FIGURE 10.5 | Excel Worksheet: Constructing a 95% Confidence Interval 
for Clearwater National Bank 
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In cell E18 the difference in the sample means is used to compute the point estimate 
of the difference in the two population means (115). The lower limit of the confidence 
interval is computed in cell E19 (37) and the upper limit is computed in cell E20 (193); 
thus, the 95% confidence interval estimate of the difference in the two population means is 
37 to 193. 


A Template for Other Problems This worksheet can be used as a template for developing 
interval estimates of the difference in population means when the population standard devi- 
ations are unknown. For another problem of this type, we must first enter the new problem 
data in columns A and B. The data ranges in cells E5:F7 must be modified in order to 
compute the sample means, sample sizes, and sample standard deviations for the new data. 
After doing so, the point estimate and a 95% confidence interval will be displayed in cells 
E18:E20. If a confidence interval with a different confidence coefficient is desired, we 
simply change the value in cell E12. 

We can further simplify the use of Figure 10.5 as a template for other problems by 
eliminating the need to enter new data ranges in cells E5:F7. We rewrite the cell formulas 


as follows: 
Cell E5: =COUNT(A:A) 
Cell F5: =COUNT(B:B) 
Cell E6: =AVERAGE(A:A) 
Cell F6: =AVERAGE(B:B) 
Cell E7: =STDEV.S(A:A) 
The file CheckAcct includes Cell F7: =STDEV.S(B:B) 
a worksheet titled Template 
that uses the A:A and B:B Using the A:A method of specifying data ranges in cells E5:E7, Excel’s COUNT 
methods for entering the function will count the number of numeric values in column A, Excel’s AVERAGE 
data ranges. function will compute the average of the numeric values in column A, and Excel’s 
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STDEV.S function will compute the standard deviation of the numeric values in column A. 
Similarly, using the B:B method of specifying data ranges in cells F5:F7, Excel’s 
COUNT function will count the number of numeric values in column B, Excel’s 
AVERAGE function will compute the average of the numeric values in column B, and 
Excel’s STDEV.S function will compute the standard deviation of the numeric values in 
column B. Thus, to solve a new problem it is only necessary to enter the new data into 
columns A and B. 

This worksheet can also be used as a template for text exercises in which the sample 
sizes, sample means, and sample standard deviations are given. In this type of situation, no 
change in the data is necessary. We simply replace the values in cells E5:F7 with the given 
values of the sample sizes, sample means, and sample standard deviations. If something 
other than a 95% confidence interval is desired, the confidence coefficient in cell E12 must 
also be changed. 


Hypothesis Tests About pw, — p, 


Let us now consider hypothesis tests about the difference between the means of two popu- 
lations when the population standard deviations g} and o are unknown. Letting D, denote 
the hypothesized difference between u, and m, Section 10.1 showed that the test statistic 
used for the case where g} and g, are known is as follows. 


(x, — X) — Do 
A ——— 
oo 

nı Ny 


The test statistic, z, follows the standard normal distribution. 

When g; and g, are unknown, we use s, as an estimator of g} and s, as an estimator of 
o>. Substituting these sample standard deviations for a, and ø, provides the following test 
statistic when g} and g, are unknown. 


TEST STATISTIC FOR HYPOTHESIS TESTS ABOUT yp, — uz: 0, AND op UNKNOWN 


(,-%,) —D 
oe (10.8) 

he 

th Wp 


The degrees of freedom for t are given by equation (10.7). 


Let us demonstrate the use of this test statistic in the following hypothesis testing example. 

Consider a new computer software package developed to help systems analysts reduce 
the time required to design, develop, and implement an information system. To evaluate the 
benefits of the new software package, a random sample of 24 systems analysts is selected. 
Each analyst is given specifications for a hypothetical information system. Then 12 of the 
analysts are instructed to produce the information system by using current technology. The 
other 12 analysts are trained in the use of the new software package and then instructed to 
use it to produce the information system. 

This study involves two populations: a population of systems analysts using the current 
technology and a population of systems analysts using the new software package. In terms 
of the time required to complete the information system design project, the population 
means are as follow. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


(@ 


DATA file 


SoftwareTest 
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TABLE 10.1 Completion Time Data and Summary Statistics for the Software 


Testing Study 


Current Technology New Software 


300 274 
280 220 
344 308 
385 336 
S72 198 
360 300 
288 315 
321 258 
376 318 
290 310 
301 382 
283 263 

Summary Statistics 

Sample size ny = 12 n = 12 

Sample mean X, = 325 hours X, = 286 hours 


Sample standard deviation g= 40 s, = 44 


u = the mean project completion time for systems analysts 
using the current technology 


W = the mean project completion time for systems analysts 
using the new software package 


The researcher in charge of the new software evaluation project hopes to show that 
the new software package will provide a shorter mean project completion time. Thus, 
the researcher is looking for evidence to conclude that u, is less than u; in this case, the 
difference between the two population means, u; — u», will be greater than zero. The 
research hypothesis u; — m, > 0 is stated as the alternative hypothesis. Thus, the hypothe- 
sis test becomes 


Ao: by = My = 9 
H; u= By > 0 
We will use a = .05 as the level of significance. 


Suppose that the 24 analysts complete the study with the results shown in Table 10.1. 
Using the test statistic in equation (10.8), we have 


Xi = %) =D 325 — 286) — 0 
= 2 ie - OI oa 
SPES 40 44 
n Ny 12 12 


Computing the degrees of freedom using equation (10.7), we have 


si 2) 40° 44? 
n Ny 12 12 


d= n 8 l ZAH 


1 (SÒ i (SY 1 (40°\? 1 (44\? 
+ + 
n, — 1\ny ny — 1\m 12-1\12 12-1\12 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


462 


Using the t distribution 
table, we can only de- 
termine a range for the 


p-value. Use of Excel (see 


Figure 10.7) shows the 
exact p-value = .017. 


Chapter 10 Inference About Means and Proportions with Two Populations 
Rounding down, we will use a ¢ distribution with 21 degrees of freedom. This row of the 
t distribution table is as follows: 


Area in Upper Tail | .20 .10 05 025 01 = .005 
t-Value (21 df) | 0.859 1.323 1.721 2.080 2.518 2.831 


t= 2.27 


With an upper tail test, the p-value is the area in the upper tail to the right of t = 2.27. From 
the above results, we see that the p-value is between .025 and .01. Thus, the p-value is less 
than a = .05 and H; is rejected. The sample results enable the researcher to conclude that 
Hi — W > 0, or u, > m. Thus, the research study supports the conclusion that the new 
software package provides a smaller population mean completion time. 


Using Excel to Conduct a Hypothesis Test 


Excel’s t-Test: Two-Sample Assuming Unequal Variances tool can be used to conduct a 
hypothesis test to determine whether there is a significant difference in population means 
when the population standard deviations are unknown. We illustrate using the sample 

data for the software evaluation study. Twelve systems analysts developed an information 
system using current technology, and 12 systems analysts developed an information sys- 
tem using a new software package. A one-tailed hypothesis test is to be conducted to see 
whether the mean completion time is shorter using the new software package. Refer to the 
Excel worksheets shown in Figure 10.6 and Figure 10.7 as we describe the tasks involved. 


Enter/Access Data: Open the file SoftwareTest. Column A in Figure 10.6 contains the 
completion time data and a label for the random sample of 12 individuals using the current 
technology, and column B contains the completion time data and a label for the random 
sample of 12 individuals using the new software. 


FIGURE 10.6 | Dialog Box for Excel's t-Test: Two-Sample Assuming Unequal 
Variances Tool 
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FIGURE 10.7 | Excel Results for the Hypothesis Test About Equality of Mean 
Project Completion Times 
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Apply Tools: The following steps will provide the information needed to conduct 
the hypothesis test to see whether there is a significant difference in favor of the new 
software. 


Step 1. Click the Data tab on the Ribbon 
Step 2. In the Analyze group, click Data Analysis 
Step 3. Choose t-Test: Two-Sample Assuming Unequal Variances from the list of 
Analysis Tools 
Step 4. When the t-Test: Two-Sample Assuming Unequal Variances dialog box appears 
(Figure 10.6): 
Enter A/:A/3 in the Variable 1 Range: box 
Enter B/:B13 in the Variable 2 Range: box 
Enter 0 in the Hypothesized Mean Difference: box 
Select the check box for Labels 
Enter .05 in the Alpha: box 
Select Output Range: and enter D/ in the box 
Click OK 


The results are shown in Figure 10.7. Descriptive statistics for the two samples are 
shown in cells E4:F6. The value of the test statistic, 2.2721, is shown in cell E9. The 
p-value for the test, labeled “P(T< =^) one-tail”, is shown in cell E10. Because the p-value, 
.0166, is less than the level of significance a = .05, we can conclude that the mean com- 
pletion time for the population using the new software package is smaller. 

The t-Test: Two-Sample Assuming Unequal Variances tool can also be used to con- 
duct two-tailed hypothesis tests. The only change required to make the hypothesis testing 
decision is that we need to use the p-value for a two-tailed test, labeled “P(T< =p) two-tail” 
(see cell E12). 


Practical Advice 


Whenever possible, equal The interval estimation and hypothesis testing procedures presented in this section are 
sample sizes, n;= n>, are robust and can be used with relatively small sample sizes. In most applications, equal 
recommended. or nearly equal sample sizes such that the total sample size n, + n, is at least 20 can 
be expected to provide very good results even if the populations are not normal. Larger 
sample sizes are recommended if the distributions of the populations are highly skewed or 
contain outliers. Smaller sample sizes should only be used if the analyst is satisfied that the 
distributions of the populations are at least approximately normal. 
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Chapter 10 


Inference About Means and Proportions with Two Populations 


NOTE + COMMENT 


Another approach used to make inferences about the dif- 
ference between two population means when g} and o; are 
unknown is based on the assumption that the two popula- 
tion standard deviations are equal (0; = 0, = a). Under this 
assumption, the two sample standard deviations are com- 
bined to provide the following pooled sample variance: 


> = 1)s? + (na — 1)s5 


S 
p — 
iit Ae 2 


and has n, + n, — 2 degrees of freedom. At this point, the 
computation of the p-value and the interpretation of the sam- 
ple results are identical to the procedures discussed earlier in 
this section. 

A difficulty with this procedure is that the assumption that 
the two population standard deviations are equal is usually 
difficult to verify. Unequal population standard deviations are 
frequently encountered. Using the pooled procedure may not 
provide satisfactory results, especially if the sample sizes n, 


NES and n, are quite different. 
The t-test statistic becomes ; . . 
The t procedure that we presented in this section does 


Qı- %) — Do not require the assumption of equal population standard de- 
t= 7 3 viations and can be applied whether the population standard 
EE deviations are equal or not. It is a more general procedure and 

PV ny m 


is recommended for most applications. 


EXERCISES 


Methods 
9. The following results are for independent random samples taken from two populations. 


Sample 1 Sample 2 
n, = 20 n, = 30 
X, = 22.5 X% = 20.1 
s= 2.5 S, = 4.8 


What is the point estimate of the difference between the two population means? 
What is the degrees of freedom for the ¢ distribution? 

At 95% confidence, what is the margin of error? 

. What is the 95% confidence interval for the difference between the two population 
means? 


amog 


10. Consider the following hypothesis test. 
Ay: by — My, = 0 
A: My — My #0 


The following results are from independent samples taken from two populations. 


Sample 1 Sample 2 
n = 35 n, = 40 
X, = 13.6 % = 10.1 
s, = 95.2 S = 8.5 


. What is the value of the test statistic? 

. What is the degrees of freedom for the ¢ distribution? 
. What is the p-value? 

. Ata = .05, what is your conclusion? 


(= FE e E e i -o 
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11. Consider the following data for two independent random samples taken from two 
normal populations. 


Sample 1 10 7 13 7 9 8 
Sample 2 8 7 8 4 6 9 


. Compute the two sample means. 

. Compute the two sample standard deviations. 

. What is the point estimate of the difference between the two population means? 

. What is the 90% confidence interval estimate of the difference between the two 
population means? 


aaoo 


Applications 
12. Miles Driven Per Day. The U.S. Department of Transportation provides the number 
of miles that residents of the 75 largest metropolitan areas travel per day in a car. 
Suppose that for a random sample of 50 Buffalo residents the mean is 22.5 miles a 
day and the standard deviation is 8.4 miles a day, and for an independent random 
sample of 40 Boston residents the mean is 18.6 miles a day and the standard devia- 
tion is 7.4 miles a day. 
a. What is the point estimate of the difference between the mean number of miles that 
Buffalo residents travel per day and the mean number of miles that Boston residents 
travel per day? 
b. What is the 95% confidence interval for the difference between the two population 
means? 
DATA fi le 13. Annual Cost of College. The increasing annual cost (including tuition, room, board, 
books, and fees) to attend college has been widely discussed in many publications 
including Money magazine. The following random samples show the annual cost of 
attending private and public colleges. Data are in thousands of dollars. 


(@ 


CollegeCosts 


Private Colleges 


52.8 43.2 45.0 3973 44.0 
30.6 45.8 37.8 50.5 42.0 


Public Colleges 


20.3 22.0 28.2 156 24.1 28.5 
22.8 25.8 18.5 20 14.4 21.8 


a. Compute the sample mean and sample standard deviation for private and public 
colleges. 

b. What is the point estimate of the difference between the two population means? 
Interpret this value in terms of the annual cost of attending private and public 
colleges. 

c. Develop a 95% confidence interval of the difference between the mean annual cost 
of attending private and public colleges. 

14. Salaries of Recent College Graduates. The Tippie College of Business obtained the 
following results on the salaries of a recent graduating class: 


Finance Majors Business Analytics Majors 
n = 110 n, = 30 
X, = $48,537 % = $55,317 
s, = $18,000 s = $10,000 
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DATA file 


IntHotels 


DATA file 


SATMath 
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15. 


17. 


a. Formulate a hypothesis so that, if the null hypothesis is rejected, we can conclude 
that salaries for Finance majors are significantly lower than the salaries of Business 
Analytics majors. Use a = .05. 

b. What is the value of the test statistic? 

c. What is the p-value? 

d. What is your conclusion? 

Hotel Prices. Hotel room pricing changes over time, but is there a difference between 

Europe hotel prices and U.S. hotel prices? The file IntHotels contains changes in the hotel 

prices for 47 major European cities and 53 major U.S. cities. 

a. On the basis of the sample results, can we conclude that the mean change in hotel 
rates in Europe and the United States are different? Develop appropriate null and 
alternative hypotheses. 

b. Use a = .01. What is your conclusion? 


. Effect of Parents’ Education on Student SAT Scores. The College Board provided 


comparisons of Scholastic Aptitude Test (SAT) scores based on the highest level of 
education attained by the test taker’s parents. A research hypothesis was that students 
whose parents had attained a higher level of education would on average score higher 
on the SAT. The overall mean SAT math score was 514. SAT math scores for inde- 
pendent samples of students follow. The first sample shows the SAT math test scores 
for students whose parents are college graduates with a bachelor’s degree. The second 
sample shows the SAT math test scores for students whose parents are high school 
graduates but do not have a college degree. 


Student's Parents 


College Grads High School Grads 
485 487 442 492 

534 533 580 478 

650 526 479 425 

554 410 486 485 

550 515 528 390 

5/2 578 524 595 

497 448 

592 469 


a. Formulate the hypotheses that can be used to determine whether the sample data 
support the hypothesis that students show a higher population mean math score on 
the SAT if their parents attained a higher level of education. 

b. What is the point estimate of the difference between the means for the two 
populations? 

c. Compute the p-value for the hypothesis test. 

d. At a = .05, what is your conclusion? 

Comparing Financial Consultant Ratings. Periodically, Merrill Lynch customers are 

asked to evaluate Merrill Lynch financial consultants and services. Higher ratings on 

the client satisfaction survey indicate better service, with 7 the maximum service rating. 

Independent samples of service ratings for two financial consultants are summarized 

here. Consultant A has 10 years of experience, whereas consultant B has 1 year of expe- 

rience. Use a = .05 and test to see whether the consultant with more experience has the 
higher population mean service rating. 


Consultant A Consultant B 
n = 16 nm = 10 
xı = 6.82 X = 6.25 
sı = .64 $= .75 
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. State the null and alternative hypotheses. 

. Compute the value of the test statistic. 

. What is the p-value? 

. What is your conclusion? 

DATA fi le 18. Comparing Length of Flight Delays. The success of an airline depends heavily on its 

ability to provide a pleasant customer experience. One dimension of customer service 

on which airlines compete is on-time arrival. The file LateFlights contains a sample of 

data from delayed flights showing the number of minutes each delayed flight was late 

for two different airlines, Delta and Southwest. 

a. Formulate the hypotheses that can be used to test for a difference between the popu- 
lation mean minutes late for delayed flights by these two airlines. 

b. What is the sample mean number of minutes late for delayed flights for each of 
these two airlines? 

c. Using a .05 level of significance, what is the p-value and what is your conclusion? 


aaoo 


(@ 


LateFlights 


10.3 Inferences About the Difference Between 
Two Population Means: Matched Samples 


Suppose employees at a manufacturing company can use two different methods to 
perform a production task. To maximize production output, the company wants to 
identify the method with the smaller population mean completion time. Let u, denote 
the population mean completion time for production method 1 and m, denote the popu- 
lation mean completion time for production method 2. With no preliminary indication 
of the preferred production method, we begin by tentatively assuming that the two 
production methods have the same population mean completion time. Thus, the null 
hypothesis is Hy: u, — m, = 0. If this hypothesis is rejected, we can conclude that the 
population mean completion times differ. In this case, the method providing the smaller 
mean completion time would be recommended. The null and alternative hypotheses are 
written as follows. 


Ay: u, =m = 0 
A: My — By #0 


In choosing the sampling procedure that will be used to collect production time data and 
test the hypotheses, we consider two alternative designs. One is based on independent 
samples and the other is based on matched samples. 


1. Independent sample design: A random sample of workers is selected and each 
worker in the sample uses method 1. A second independent random sample of 
workers is selected and each worker in this sample uses method 2. The test of the 
difference between population means is based on the procedures in Section 10.2. 

2. Matched sample design: One random sample of workers is selected. Each worker 
first uses one method and then uses the other method. The order of the two methods 
is assigned randomly to the workers, with some workers performing method 1 first 
and others performing method 2 first. Each worker provides a pair of data values, 
one value for method 1 and another value for method 2. 


In the matched sample design, the two production methods are tested under similar con- 
ditions (i.e., with the same workers); hence this design often leads to a smaller sampling 
error than the independent sample design. The primary reason is that in a matched sample 
design, variation between workers is eliminated because the same workers are used for 
both production methods. 

Let us demonstrate the analysis of a matched sample design by assuming it is the 
method used to test the difference between population means for the two production methods. 
A random sample of six workers is used. The data on completion times for the six workers 
are given in Table 10.2. Note that each worker provides a pair of data values, one for each 
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DATA file 


Matched 


Other than the use of the 
d notation, the formulas 
for the sample mean and 
sample standard deviation 
are the same ones used 


previously in the text. 


It is not necessary to make 
the assumption that the 
population has a normal 
distribution if the sample 
size is large. Sample size 
guidelines for using the 

t distribution were presen- 
ted in Chapters 8 and 9. 


Once the difference data 
are computed, the 

t distribution procedure 

for matched samples is the 
same as the one-population 
estimation and hypothesis 
testing procedures de- 
scribed in Chapters 8 and 9. 
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Chapter 10 


TABLE 10.2 Task Completion Times for a Matched Sample Design 


Completion Time Completion Time Difference in 


Inference About Means and Proportions with Two Populations 


for Method 1 for Method 2 Completion 
Worker (minutes) (minutes) Times (d;) 
1 6.0 5.4 de) 
2 5.0 5.2 =D 
3 7.0 6.5 5 
4 6.2 519) 3 
5 6.0 6.0 O 
6 6.4 58 6 


production method. Also note that the last column contains the difference in completion 
times d, for each worker in the sample. 

The key to the analysis of the matched sample design is to realize that we consider 
only the column of differences. Therefore, we have six data values (.6, —.2, .5, .3, .0, 
and .6) that will be used to analyze the difference between population means of the two 
production methods. 

Let u; = the mean of the difference in values for the population of workers. With this 
notation, the null and alternative hypotheses are rewritten as follows. 


Ao: ba = 9 
Hx by #0 


If H, is rejected, we can conclude that the population mean completion times differ. 

The d notation is a reminder that the matched sample provides difference data. The 
sample mean and sample standard deviation for the six difference values in Table 10.2 
follow. 


With the small sample of n = 6 workers, we need to make the assumption that the popu- 
lation of differences has a normal distribution. This assumption is necessary so that we 
may use the ż distribution for hypothesis testing and interval estimation procedures. 
Based on this assumption, the following test statistic has a ¢ distribution with n — 1 
degrees of freedom. 


TEST STATISTIC FOR HYPOTHESIS TESTS INVOLVING MATCHED SAMPLES 


_d Bs 
5qlVn 


t (10.9) 


Let us use equation (10.9) to test the hypotheses Hp: w, = 0 and H,: w, # 0, using a = .05. 
Substituting the sample results g = .30, s, = .335, and n = 6 into equation (10.9), we com- 
pute the value of the test statistic. 


oda 30-0 _ 
sa/ Vn  .335/V6 


t 
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Now let us compute the p-value for this two-tailed test. Because t = 2.20 > 0, the test 
statistic is in the upper tail of the ¢ distribution. With t = 2.20, the area in the upper tail 
to the right of the test statistic can be found by using the ż distribution table with degrees 
of freedom = n — 1 = 6 — 1 = 5. Information from the 5 degrees of freedom row of the 
t distribution table is as follows: 


Area in Upper Tail | .20 .10 05 025 01 005 
t-Value (5 df) | 0.920 1.476 2.015 2.571 3.365 4.032 
t = 2.20 


Thus, we see that the area in the upper tail is between .05 and .025. Because this test is a 
two-tailed test, we double these values to conclude that the p-value is between .10 and .05. 
This p-value is greater than a = .05. Thus, the null hypothesis Hp: 1, = O is not rejected. 
Using Excel and the data in Table 10.2, we find the exact p-value = .0795. 
Chapter 8 discusses the In addition we can obtain an interval estimate of the difference between the two popu- 
construction of an interval lation means by using the single population methodology. At 95% confidence, the calcula- 
estimate for a single popu- tion follows. 
lation mean. 


Thus, the margin of error is .35 and the 95% confidence interval for the difference between 
the population means of the two production methods is —.05 minutes to .65 minutes. 


Using Excel to Conduct a Hypothesis Test 


Excel’s t-Test: Paired Two Sample for Means tool can be used to conduct a hypothesis test 
about the difference between the population means when a matched sample design is used. 
We illustrate by conducting the hypothesis test involving the two production methods. 
Refer to the Excel worksheets shown in Figure 10.8 and Figure 10.9 as we describe the 
tasks involved. 


Enter/Access Data: Open the file Matched. Column A in Figure 10.8 is used to identify 
each of the six workers who participated in the study. Column B contains the completion 
time data for each worker using method 1, and column C contains the completion time data 
for each worker using method 2. 


Apply Tools: The following steps describe how to use Excel’s t-Test: Paired Two Sample 
for Means tool to conduct the hypothesis test about the difference between the means of the 
two production methods 


Step 1. Click the Data tab on the Ribbon 
Step 2. In the Analyze group, click Data Analysis 
Step 3. Choose t-Test: Paired Two Sample for Means from the list of Analysis 
Tools 
Step 4. When the t-Test: Paired Two Sample for Means dialog box appears (Figure 10.8): 
Enter B/:B7 in the Variable 1 Range: box 
Enter C/:C7 in the Variable 2 Range: box 
Enter 0 in the Hypothesized Mean Difference: box 
Select the check box for Labels 
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FIGURE 10.8 


Chapter 10 Inference About Means and Proportions with Two Populations 


Enter .05 in the Alpha: box 

Select Output Range in the Output options area 
Enter E1 in the Output Range: box 

Click OK 


The results are shown in cells E1:G14 of the worksheet shown in Figure 10.9. The 
p-value for the test, labeled “P(T< =^) two-tail’”’, is shown in cell F13. Because the p-value, 
.0795, is greater than the level of significance a = .05, we cannot reject the null hypothesis 
that the mean completion times are equal. 

The same procedure can also be used to conduct one-tailed hypothesis tests. The only 
change required to make the hypothesis testing decision is that we need to use the p-value 
for a one-tailed test, labeled “P(T< =f) one-tail” (see cell F11). 


Dialog Box for Excel's t-Test: Paired Two Sample for Means Tool 


za 


Worker 


Bo ee Ie o ECER a <a aAa EREL 
Method 1 Method 2 


Input 
Variable 1 Range: $B$1:$8$7 


Variable 2 Range: $C$1:$C$7 


scsuscs7 
ù ; Help 
Hypothesized Mean Difference: 0 


PAE 


I 
2 
3 
4 
5 
6 


Alpha: 0.05 

Output options 

@ Output Range: EL ES 
© New Worksheet Ply: | 
© New Workbook 


FIGURE 10.9 | Excel Results for the Hypothesis Test in the Matched 
Samples Study 


B ee! | D 
Method 2 
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NOTES + COMMENTS 


1. 


In the example presented in this section, workers per- 
formed the production task with first one method and 
then the other method. This example illustrates a matched 
sample design in which each sampled element (worker) 
provides a pair of data values. It is also possible to use 
different but “similar” elements to provide the pair of data 
values. For example, a worker at one location could be 
matched with a similar worker at another location (simi- 


larity based on age, education, gender, experience, etc.). 


EXERCISES 


The pairs of workers would provide the difference data 
that could be used in the matched sample analysis. 


. A matched sample procedure for inferences about two 


population means generally provides better precision 
than the independent sample approach; therefore it is the 
recommended design. However, in some applications the 
matching cannot be achieved, or perhaps the time and 
cost associated with matching are excessive. In such cases, 
the independent sample design should be used. 


Methods 

19. Consider the following hypothesis test. 
Ay: by = 9 
Hx bg > 0 


The following data are from matched samples taken from two populations. 


Population 
Element 1 2 
1 21 20 
2 28 26 
3 18 18 
4 20 20 
5 26 24 


. Compute d. 


oorp 


. Compute the difference value for each element. 


. Compute the standard deviation sj. 


d. Conduct a hypothesis test using a = .05. What is your conclusion? 
20. The following data are from matched samples taken from two populations. 


Population 
Element 1 2 
1 11 8 
2 7 8 
3 9 6 
4 {2 7 
5 183 10 
6 15 15 
7 15 14 


a. Compute the difference value for each element. 


b. Compute d. 


c. Compute the standard deviation s4. 
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d. What is the point estimate of the difference between the two population means? 
e. Provide a 95% confidence interval for the difference between the two population 
means. 


Applications 

21. Television Commercials and Product Purchase Potential. A market research firm 
used a sample of individuals to rate the purchase potential of a particular product 
before and after the individuals saw a new television commercial about the product. 
The purchase potential ratings were based on a 0 to 10 scale, with higher values in- 
dicating a higher purchase potential. The null hypothesis stated that the mean rating 
“after” would be less than or equal to the mean rating “before.” Rejection of this 
hypothesis would show that the commercial improved the mean purchase potential 
rating. Use a = .05 and the following data to test the hypothesis and comment on the 
value of the commercial. 


Purchase Rating Purchase Rating 
Individual After Before Individual After Before 
1 6 5 5 3 5 
2 6 4 6 9 8 
3 7 Hf 7 7 5 
4 4 3 8 6 6 
= DATA file 22. First Quarter Stock Market Performance. The price per share of stock for a sample 


of 25 companies was recorded at the beginning of the first financial quarter and then 
again at the end of the first financial quarter. How stocks perform during the first 
quarter is an indicator of what is ahead for the stock market and the economy. Use the 
sample data in the file StockQuarter to answer the following. 

a. Let d; denote the percentage change in price per share for company i where 


StockQuarter 


end of \"quarter price per share — beginning of 1" quarter price per share 


i beginning of 1" quarter price per share 
Use the sample mean of these values to estimate the percentage change in the stock 
price over the first quarter. 

b. What is the 95% confidence interval estimate of the population mean percentage 
change in the price per share of stock during the first quarter? Interpret this result. 

23. Credit Card Expenditures. Bank of America’s Consumer Spending Survey collected 
data on annual credit card charges in seven different categories of expenditures: transporta- 
tion, groceries, dining out, household expenses, home furnishings, apparel, and entertain- 
ment. Using data from a sample of 42 credit card accounts, assume that each account was 
used to identify the annual credit card charges for groceries (population 1) and the annual 
credit card charges for dining out (population 2). Using the difference data, the sample 
mean difference was d = $850, and the sample standard deviation was s} = $1123. 

a. Formulate the null and alternative hypotheses to test for no difference between the 
population mean credit card charges for groceries and the population mean credit 
card charges for dining out. 

b. Use a .05 level of significance. Can you conclude that the population means dif- 
fer? What is the p-value? 

c. Which category, groceries or dining out, has a higher population mean annual credit 
card charge? What is the point estimate of the difference between the population 
means? What is the 95% confidence interval estimate of the difference between the 
population means? 

; 24. Domestic Airfare. The Global Business Travel Association reported the domestic 
DATAf ile airfare for business travel for the current year and the previous year. Below is a sample 
of 12 flights with their domestic airfares shown for both years. 


(@ 


BusinessTravel 
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Current Year Previous Year Current Year Previous Year 
345 315 635 585 
526 463 710 650 
420 462 605 545 
216 206 S17 547 
285 275 570 508 
405 432 610 580 


a. Formulate the hypotheses and test for a significant increase in the mean domestic 
airfare for business travel for the one-year period. What is the p-value? Using a .05 
level of significance, what is your conclusion? 

b. What is the sample mean domestic airfare for business travel for each year? 

c. What is the percentage change in the airfare for the one-year period? 

25. SAT Scores. The College Board SAT college entrance exam consists of two sections: 
math and evidence-based reading and writing (EBRW). Sample data showing the math 
and EBRW scores for a sample of 12 students who took the SAT follow. 


Student Math EBRW Student Math EBRW 
ji 1 540 474 7 480 430 
: 2 432 380 8 499 459 
SS DATA file 3 528 463 9 610 615 
TestScores 4 574 612 10 572 541 
5 448 420 11 390 335 
6 502 526 12 5923 613 
a. Use a .05 level of significance and test for a difference between the population 
mean for the math scores and the population mean for the EBRW scores. What is 
the p-value and what is your conclusion? 
b. What is the point estimate of the difference between the mean scores for the two 
tests? What are the estimates of the population mean scores for the two tests? 
Which test reports the higher mean score? 
26. PGA Tour Scores. Scores in the first and fourth (final) rounds for a sample of 
20 golfers who competed in PGA tournaments are shown in the following table. 
Suppose you would like to determine whether the mean score for the first round of a 
PGA Tour event is significantly different than the mean score for the fourth and final 
round. Does the pressure of playing in the final round cause scores to go up? Or does 
the increased player concentration cause scores to come down? 
First Final First Final 
Player Round Round Player Round Round 
Michael Letzig 70 72 Aron Price 72 72 
@ , Scott Verplank 7A 72 Charles Howell 72 70 
= DATA file D. A. Points 70 75 Jason Dufner 70 73 
GolfScores Jerry Kelly 72 71 Mike Weir 70 Uy 
Soren Hansen 70 69 Carl Pettersson 68 70 
D. J. Trahan 67 67 Bo Van Pelt 68 65 
Bubba Watson TA 67 Ernie Els 71 70 
Reteif Goosen 68 75 Cameron Beckman 70 68 
Jeff Klauk 67 73 Nick Watney 69 68 
Kenny Perry 70 69 Tommy Armour III 67 71 
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a. Use a = .10 to test for a statistically significantly difference between the popula- 
tion means for first- and fourth-round scores. What is the p-value? What is your 
conclusion? 

b. What is the point estimate of the difference between the two population means? For 
which round is the population mean score lower? 

c. What is the margin of error for a 90% confidence interval estimate for the differ- 
ence between the population means? Could this confidence interval have been used 
to test the hypothesis in part (a)? Explain. 

27. Price Comparison of Smoothie Blenders. A personal fitness company produces 
both a deluxe and a standard model of a smoothie blender for home use. Selling prices 
obtained from a sample of retail outlets follow. 


Model Price ($) Model Price ($) 
Retail Outlet Deluxe Standard Retail Outlet Deluxe Standard 
1 39 27. 5 40 30 
2 39 28 6 39 34 
3 45 35 7 35 29 
4 38 30 


a. The manufacturer’s suggested retail prices for the two models show a $10 price dif- 
ferential. Use a .05 level of significance and test that the mean difference between 
the prices of the two models is $10. 

b. What is the 95% confidence interval for the difference between the mean prices of 
the two models? 


10.4 Inferences About the Difference Between 
Two Population Proportions 


Letting pı denote the proportion for population 1 and p, denote the proportion for popula- 
tion 2, we next consider inferences about the difference between the two population propor- 
tions: p, — p>. To make an inference about this difference, we will select two independent 
random samples consisting of n, units from population 1 and n, units from population 2. 


Interval Estimation of p, — p2 


In the following example, we show how to compute a margin of error and develop an inter- 
val estimate of the difference between two population proportions. 

A tax preparation firm is interested in comparing the quality of work at two of its 
regional offices. By randomly selecting samples of tax returns prepared at each office and 
verifying the sample returns’ accuracy, the firm will be able to estimate the proportion of 
erroneous returns prepared at each office. Of particular interest is the difference between 
these proportions. 


P, = proportion of erroneous returns for population 1 (office 1) 
Pa = proportion of erroneous returns for population 2 (office 2) 
P, = sample proportion for a random sample from population 1 
P, = sample proportion for a random sample from population 2 


The difference between the two population proportions is given by p, — p». The point esti- 
mator of p, — p; is as follows. 


POINT ESTIMATOR OF THE DIFFERENCE BETWEEN TWO POPULATION PROPORTIONS 


Pi- Pr (10.10) 
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Sample sizes involving 
proportions are usually 
large enough to use this 


approximation. 


(Q 


DATA file 


TaxPrep 
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Thus, the point estimator of the difference between two population proportions is the dif- 
ference between the sample proportions of two independent random samples. 

As with other point estimators, the point estimator p, — p, has a sampling distribution 
that reflects the possible values of p, — p, if we repeatedly took two independent random 
samples. The mean of this sampling distribution is p, — p, and the standard error of p; — p, 
is as follows: 


STANDARD ERROR OF p, — p, 


va —p,) _ pl — p,) 
OF =p = a 


k ny Ny 


(10.11) 


If the sample sizes are large enough that n,p,,n,(1 — p,), n p>, and n,(1 — p,) are all 
greater than or equal to 5, the sampling distribution of p, — p, can be approximated by a 
normal distribution. 

As we showed previously, an interval estimate is given by a point estimate + a margin 
of error. In the estimation of the difference between two population proportions, an interval 
estimate will take the following form: 


Pı — P> + Margin of error 


With the sampling distribution of p, — p, approximated by a normal distribution, we 
would like to use 2a/2p,—p, 28 the margin of error. However, Cs given by equation 
(10.11) cannot be used directly because the two population proportions, p, and p,, are 
unknown. Using the sample proportion p, to estimate p, and the sample proportion p, to 


estimate p,, the margin of error is as follows. 


pil — pj) j P2(1 — po) 


Margin of error = zn J (10.12) 


ny Ny 


The general form of an interval estimate of the difference between two population propor- 
tions is as follows. 


INTERVAL ESTIMATE OF THE DIFFERENCE BETWEEN TWO POPULATION PROPORTIONS 


ee g= = n D — Py) 
1 2 = Zan 


ny Ny 


(10.13) 


where | — a is the confidence coefficient. 


Returning to the tax preparation example, we find that independent random samples 
from the two offices provide the following information. 


Office 2 


n, = 300 
Number of returns with errors = 27 


Office 1 
n, = 250 
Number of returns with errors = 35 


The sample proportions for the two offices follow. 


_ 38 
a a i 
Pi 350 
n= 2 og 
Fo 399° 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


476 Chapter 10 Inference About Means and Proportions with Two Populations 


The point estimate of the difference between the proportions of erroneous tax returns for 
the two populations is p, — p, = .14 — .09 = .05. Thus, we estimate that office 1 has a 
.05, or 5%, greater error rate than office 2. 

Expression (10.13) can now be used to provide a margin of error and interval estimate 
of the difference between the two population proportions. Using a 90% confidence interval 
with Z,. = Zos = 1.645, we have 


Pi — pi) + P21 — Po) 
ny Ny 


dn were, AO, OU) 
i aia 250 300 


Pi T Pa È Zan 


.05 + .045 


Thus, the margin of error is .045, and the 90% confidence interval is .005 to .095. 


Using Excel to Construct a Confidence Interval 


We can create a worksheet for developing an interval estimate of the difference between 
population proportions. Let us illustrate by developing an interval estimate of the differ- 
ence between the proportions of erroneous tax returns at the two offices of the tax prepara- 
tion firm. Refer to Figure 10.10 as we describe the tasks involved. The formula worksheet 
is in the background; the value worksheet appears in the foreground. 


Enter/Access Data: Open the file TaxPrep. Columns A and B contain headings and Yes or 
No data that indicate which of the tax returns from each office contain an error. 


FIGURE 10.10 | Constructing a 90% Confidence Interval for the Difference in the 


Proportion of Erroneous Tax Returns Prepared By Two Offices 


se} D i E | F L9 i 
Interval Estimate of Difference 
in Population Proportions 


Office 1 
Sample Size =COUNTA(A2:A251) 
Response of Interest Yes 
Count for Response "COUNTIF(A2:4251,E6) 
Sample Proportion =E7/ES 


Confidence Coefficient 0.9 
Level of Significance =1-E10 
z Value =NORM.S.INV(1-E11/2) 


D peat ee E eS et) 
Interval Estimate of Difference 
in Population Proportions 


Office 1 Office 2 


Standard Error =SQRT(E8*(1-E8)/ES2F8*(1-F8)FS) se 


—— N Response of Interest Yes Yes 
Point Estimate of Difference =E8-F8 Count for Response 35 27 
Lower Limit =E17-E15 Sample Proportion 0.14 0.09 
Upper Limit =E17+E1S 
Confidence Coefficient 0.9 
Level of Significance 01 
z Value 1.645 


Standard Error 0.0275 
Margin of Error 0.0452 


Point Estimate of Difference 0.05 
Lower Limit 0.0048 


Note: Rows 20-249 and Upper Limit 0.0952 


252-299 are hidden. 
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The Excel function Enter Functions and Formulas: The descriptive statistics needed are provided in 

COUNTA counts all cells cells E5:F5 and E7:F8. Note that Excel’s COUNTA function is used in cells E5 and 

that are not empty, i.e., it F5 to count the number of observations for each of the samples. The value worksheet 

counts "anything," while indicates 250 returns in the sample from office 1 and 300 returns in the sample from 

the Excel function COUNT office 2. In cells E6 and F6, we type Yes to indicate the response of interest (an errone- 

only counts cells with ous return). Excel’s COUNTIF function is used in cells E7 and F7 to count the number 

numeric values. of Yes responses from each office. Formulas entered into cells E8 and F8 compute the 
sample proportions. Entering .9 into cell E10 for the value of the confidence coefficient, 
the corresponding level of significance (œ = .10) is computed in cell E11. In cell E12 
we use the NORM.S.INV function to compute the z value needed to compute the margin 
of error for the interval estimate. 


In cell E14, a point estimate of o% _;,, the standard error of the point estimator p, — p», is 
computed based on the two sample proportions (E8 and F8) and sample sizes (E5 and F5). 
The margin of error is then computed in cell E15 by multiplying the z value by the estimate 
of the standard error. 

The point estimate of the difference in the two population proportions is computed in cell 
E17 as the difference in the sample proportions; the result, shown in the value worksheet, 
is .05. The lower limit of the confidence interval is computed in cell E18 by subtracting the 
margin of error from the point estimate. The upper limit is computed in cell E19 by adding 
the margin of error to the point estimate. The value worksheet shows that the 90% confi- 


dence interval estimate of the difference in the two population proportions is .0048 to .0952. 


A Template for Other Problems This worksheet can be used as a template for other 
problems requiring an interval estimate of the difference in population proportions. The 
new data must be entered in columns A and B. The data ranges in the cells used to compute 
the sample size (E5:F5) and the cells used to compute a count of the response of interest 
(E7:F7) must be changed to correctly indicate the location of the new data. The response of 
interest must be typed into cells E6:F6. The 90% confidence interval for the new data will 
then appear in cells E17:E19. If an interval estimate with a different confidence coefficient 
is desired, simply change the entry in cell E10. 

This worksheet can also be used as a template for solving text exercises in which the 
sample data have already been summarized. No change in the data section is necessary. 
Simply type the values for the given sample sizes in cells E5:F5 and type the given values 
for the sample proportions in cells E8:F8. The 90% confidence interval will then appear 
in cells E17:E19. If an interval estimate with a different confidence coefficient is desired, 
simply change the entry in cell E10. 


Hypothesis Tests About p; — p, 


Let us now consider hypothesis tests about the difference between the proportions of two 
populations. We focus on tests involving no difference between the two population propor- 
tions. In this case, the three forms for a hypothesis test are as follows: 


All hypotheses considered Ay: pı — Po 29 Hy: Pp; — p: =9 Hy pı — p= 9 
use 0 as the difference of Hip = p, <0 H: p7p > 0 H: p; — p #0 
interest. g 


When we assume H}, is true as an equality, we have p, — p, = 0, which is the same as 
saying that the population proportions are equal, p, = p>. 

We will base the test statistic on the sampling distribution of the point estimator p; — p». 
In equation (10.11), we showed that the standard error of p, — p, is given by 


p(l — py). p(l- py) 
o>, =p = + 


ny Ny 
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Under the assumption H, is true as an equality, the population proportions are equal and 
Pı = p = p. In this case, o- _— becomes 
Pi P 


STANDARD ERROR OF p, — p, WHEN p; = p, = P 


Al =a) ell =a ie 
Fen ear = pa -of 4+5) (10.14) 


With p unknown, we pool, or combine, the point estimators from the two samples (p, and p,) 
to obtain a single point estimator of p as follows: 


POOLED ESTIMATOR OF p WHEN p, = p, = p 


np, + P 
eee (10.15) 


ny + Ny 


This pooled estimator of p is a weighted average of p, and p}. 

Substituting p for p in equation (10.14), we obtain an estimate of the standard error of 
Pı — P>. This estimate of the standard error is used in the test statistic. The general form of 
the test statistic for hypothesis tests about the difference between two population propor- 


tions is the point estimator divided by the estimate of o; _;,. 


TEST STATISTIC FOR HYPOTHESIS TESTS ABOUT p; — p, 


= (Pı = Ds), 
il 1 
Na ae i =| 


This test statistic applies to large sample situations where np, n,(1 — pı), nP», and 
n,(1 — p>) are all greater than or equal to 5. 

Let us return to the tax preparation firm example and assume that the firm wants to use a 
hypothesis test to determine whether the error proportions differ between the two offices. A 
two-tailed test is required. The null and alternative hypotheses are as follows: 


(10.16) 


Ay: pı — P2 = 9 
Hx pı — Pr #0 
If Ho is rejected, the firm can conclude that the error rates at the two offices differ. We will 
use a = .10 as the level of significance. 
The sample data previously collected showed p, = .14 for the n, = 250 returns sam- 
pled at office 1 and p, = .09 for the n, = 300 returns sampled at office 2. We continue by 
computing the pooled estimate of p. 


__ Pi + MP, _ 250(.14) + 300(.09) 
n +m 250 + 300 


= 1127 


Using this pooled estimate and the difference between the sample proportions, the value of 
the test statistic is as follows. 


_- (Pi =F) T (14 — .09) dee 


pa -a(i+! 112711 — 12555 + 3) 
P Pin m ' i 250 ` 300 
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In computing the p-value for this two-tailed test, we first note that z = 1.85 is in the 
upper tail of the standard normal distribution. Using z = 1.85 and the standard normal dis- 
tribution table, we find the area in the upper tail is 1.0000 — .9678 = .0322. Doubling this 
area for a two-tailed test, we find the p-value = 2(.0322) = .0644. With the p-value less 
than a = .10, H, is rejected at the .10 level of significance. The firm can conclude that the 
error rates differ between the two offices. This hypothesis testing conclusion is consistent 
with the earlier interval estimation results that showed the interval estimate of the differ- 
ence between the population error rates at the two offices to be .005 to .095, with office 
1 having the higher error rate. 


Using Excel to Conduct a Hypothesis Test 


We can create a worksheet for conducting a hypothesis test about the difference between 
population proportions. Let us illustrate by testing to see whether there is a significant 
difference between the proportions of erroneous tax returns at the two offices of the tax 
preparation firm. Refer to Figure 10.11 as we describe the tasks involved. The formula 
worksheet is in the background; the value worksheet is in the foreground. 


Enter/Access Data: Open the file TaxPrep. Columns A and B contain headings and Yes or 
No data that indicate which of the tax returns from each office contain an error. 


Enter Functions and Formulas: The descriptive statistics needed to perform the hypo- 
thesis test are provided in cells E5:F5 and E7:F8. They are the same as the ones used for an 
interval estimate (see Figure 10.10). The hypothesized value of the difference between the 
two populations is zero; it is entered into cell E10. In cell E11, the difference in the sample 
proportions is used to compute a point estimate of the difference in the two population 
proportions. Using the two sample proportions and sample sizes, a pooled estimate of the 
population proportion p is computed in cell E13; its value is .1127. Then, in cell E14, an 
estimate of o; _; is computed using equation (10.14), with the pooled estimate of p and 
the sample sizes. 


FIGURE 10.11 | Hypothesis Test Concerning Difference in Proportion of Erroneous Tax 
Returns Prepared by Two Offices 


c D E F G 
Hypothesis Test Concerning Difference 
Between Poputation Proportions 


D E F G 
Hypothesis Test Concerning Difference 
Between Population Proportions 


Hypothesized Value 0 
Point Estimate of Difference =ES-FS 


Pooled Estimate ofp =(ES*ES+FS*FS)(ES+FS) 
Standard Error =SQRT(E13"(1-E13)"(1/ES*U/FS)) 


Office 1 Office? 
Sample Size 250 300 
Test Statistic =(E11-E10VE14 Response of Interest Yes Yes 
Count for Response 35 27 
p-value (Lower Tail) "NORMS.DIST(EIS,TRUE) Sample Proportion f 
p-value (Upper Tail) =1-NORM.S.DIST(E15,TRUE) - -x 
p-value (Two Tail) =2*MIN(E17,E18) Hypothesized Value J 
Point Estimate of Difference 0.05 
Pooled Estimate of p 0.1127 
Standard Error 0.0271 
Test Statistic 1.8462 
p-value (Lower Tail) 0.9676 
p-value (Upper Tail) 0.0324 
p-value (Two Tail) 0.0649 


Note: Rows 20-249 and 
252-299 are hidden. 
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The p-value here (.0649) The formula =(E//-E10)/E/4 entered into cell E15 computes the test statistic z (1.8462). 
differs from the one we The NORM.S.DIST function is then used to compute the p-value (Lower Tail) and the 
found using the cumulative p-value (Upper Tail) in cells E17 and E18. The p-value (Two Tail) is computed in cell E19 as 
normal probability tables twice the minimum of the two one-tailed p-values. The value worksheet shows that p-value 
(.0644) due to rounding. (Two Tail) = .0649. Because the p-value = .0649 is less than the level of significance, 
a = .10, we have sufficient evidence to reject the null hypothesis and conclude that the 
population proportions are not equal. 

This worksheet can be used as a template for hypothesis testing problems involving 
differences between population proportions. The new data can be entered into columns 
A and B. The ranges for the new data and the response of interest need to be revised in 
cells E5:F7. The remainder of the worksheet will then be updated as needed to conduct the 
hypothesis test. If a hypothesized difference other than 0 is to be used, the new value must 
be entered in cell E10. 

To use this worksheet for exercises in which the sample statistics are given, just type in 
the given values for cells E5:F5 and E7:F8. The remainder of the worksheet will then be 
updated to conduct the hypothesis test. If a hypothesized difference other than 0 is to be 
used, the new value must be entered in cell E10. 


EXERCISES 


Methods 
28. Consider the following results for independent samples taken from two populations. 


Sample 1 Sample 2 
n, = 400 n, = 300 
Pp, = 48 Dp = .36 


a. What is the point estimate of the difference between the two population proportions? 
b. Develop a 90% confidence interval for the difference between the two population 
proportions. 
c. Develop a 95% confidence interval for the difference between the two population 
proportions. 
29. Consider the hypothesis test 


Ay: Pp; =p: =9 
A,: Pp; — Pp, > 0 


The following results are for independent samples taken from the two populations. 


Sample 1 Sample 2 
n, = 200 n, = 300 
P = .22 p= .16 


a. What is the p-value? 
b. With a = .05, what is your hypothesis testing conclusion? 


Applications 

30. Corporate Hiring Outlook. A BusinessWeek/Harris poll asked senior executives 
at large corporations their opinions about the economic outlook for the future. One 
question was, “Do you think that there will be an increase in the number of full-time 
employees at your company over the next 12 months?” In the current survey, 220 of 
400 executives answered Yes, while in a previous year survey, 192 of 400 executives 
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had answered Yes. Provide a 95% confidence interval estimate for the difference 
between the proportions at the two points in time. What is your interpretation of the 
interval estimate? 

31. Impact of Pinterest on Purchase Decisions. Forbes reports that women trust recom- 
mendations from Pinterest more than recommendations from any other social network 
platform (Forbes.com). But does trust in Pinterest differ by gender? The following 
sample data show the number of women and men who stated in a recent sample that 
they trust recommendations made on Pinterest. 


Women Men 
Sample 150 170 
Trust Recommendations Made on Pinterest Vy 102 


a. What is the point estimate of the proportion of women who trust recommendations 
made on Pinterest? 

b. What is the point estimate of the proportion of men who trust recommendations 
made on Pinterest? 

c. Provide a 95% confidence interval estimate of the difference between the propor- 
tion of women and men who trust recommendations made on Pinterest. 

32. Mislabeling of Fish. Researchers with Oceana, a group dedicated to preserving the 
ocean ecosystem, reported finding that 33% of fish sold in retail outlets, grocery stores, 
and sushi bars throughout the United States had been mislabeled (San Francisco 
Chronicle). Does this mislabeling differ for different species of fish? The following 
data show the number labeled incorrectly for samples of tuna and mahi mahi. 


Tuna Mahi Mahi 
Sample 220 160 
Mislabeled 99 56 


a. What is the point estimate of the proportion of tuna that is mislabeled? 

b. What is the point estimate of the proportion of mahi mahi that is mislabeled? 

c. Provide a 95% confidence interval estimate of the difference between the propor- 
tions of tuna and mahi mahi that is mislabeled. 
33. Voter Turnout. Minnesota had the highest turnout rate of any state for the 2016 pres- 
idential election (United States Election Project website). Political analysts wonder if 
turnout in rural Minnesota was higher than turnout in the urban areas of the state. A 
sample shows that 663 of 884 registered voters from rural Minnesota voted in the 2016 
presidential election, while 414 out of 575 registered voters from urban Minnesota voted. 
a. Formulate the null and alternative hypotheses that can be used to test whether reg- 
istered voters in rural Minnesota were more likely than registered voters in urban 
Minnesota to vote in the 2016 presidential election. 

b. What is the proportion of sampled registered voters in rural Minnesota that voted in 
the 2016 presidential election? 

c. What is the proportion of sampled registered voters in urban Minnesota that voted 
in the 2016 presidential election? 

d. Ata = .05, test the political analysts’ hypothesis. What is the p-value, and what 
conclusion do you draw from your results? 

34. Oil Well Drilling. Oil wells are expensive to drill, and dry wells are a great concern to 
oil exploration companies. The domestic oil and natural gas producer Aegis Oil, LLC 
describes on its website how improvements in technologies such as three-dimensional 
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seismic imaging have dramatically reduced the number of dry (nonproducing) wells it 
and other oil exploration companies drill. The following sample data for wells drilled 
in 2012 and 2018 show the number of dry wells that were drilled in each year. 


2012 2018 
Wells Drilled 119 162 
Dry Wells 24 18 


a. Formulate the null and alternative hypotheses that can be used to test whether the 
wells drilled in 2012 were more likely to be dry than wells drilled in 2018. 

b. What is the point estimate of the proportion of wells drilled in 2012 that were dry? 

c. What is the point estimate of the proportion of wells drilled in 2018 that were dry? 

d. What is the p-value of your hypothesis test? At a = .05, what conclusion do you 
draw from your results? 

35. Hotel Occupancy Rates. Tourism is extremely important to the economy of Florida. 
Hotel occupancy is an often-reported measure of visitor volume and visitor activity 
(Orlando Sentinel, May 19, 2018). Hotel occupancy data for February in two consecu- 
tive years are as follows. 


Current Year Previous Year 
Occupied Rooms 1470 1458 
Total Rooms 1750 1800 


a. Formulate the hypothesis test that can be used to determine whether there has been 
an increase in the proportion of rooms occupied over the one-year period. 

b. What is the estimated proportion of hotel rooms occupied each year? 

c. Using a .05 level of significance, what is your hypothesis test conclusion? What is 
the p-value? 

d. What is the 95% confidence interval estimate of the change in occupancy for the 
one-year period? Do you think area officials would be pleased with the results? 

36. Gender Differences in Raise or Promotion Expectations. The Adecco Work- 

place Insights Survey sampled men and women workers and asked if they expected 

to get a raise or promotion this year (USA Today). Suppose the survey sampled 200 

men and 200 women. If 104 of the men replied Yes and 74 of the women replied 

Yes, are the results statistically significant in that you can conclude a greater pro- 

portion of men are expecting to get a raise or a promotion this year? 

a. State the hypothesis test in terms of the population proportion of men and the 
population proportion of women? 

b. What is the sample proportion for men? For women? 

c. Use a .01 level of significance. What is the p-value and what is your conclusion? 

37. Default Rates on Bank Loans. Carl Allen and Norm Nixon are two loan offi- 

cers at Brea Federal Savings and Loan Bank. The bank manager is interested in 

comparing the default rate on the loans approved by Carl to the default rate on 

the loans approved by Norm. In the sample of loans collected, there are 60 loans 

approved by Carl (9 of which defaulted) and 80 loans approved by Norm (7 of 

which defaulted). 

a. State the hypothesis test that the default rates are the same for the two loan 
officers. 

b. What is the sample default proportion for Carl? For Norm? 

c. Use a .05 level of significance. What is the p-value and what is your conclusion? 
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SUMMARY 


In this chapter we discussed procedures for developing interval estimates and conduct- 

ing hypothesis tests involving two populations. First, we showed how to make inferences 
about the difference between two population means when independent random samples 

are selected. We first considered the case where the population standard deviations o, and 
©, could be assumed known. The standard normal distribution z was used to develop the 
interval estimate and served as the test statistic for hypothesis tests. We then considered the 
case where the population standard deviations were unknown and estimated by the sample 
standard deviations s, and s,. In this case, the ¢ distribution was used to develop the interval 
estimate and the f value served as the test statistic for hypothesis tests. 

Inferences about the difference between two population means were then discussed for 
the matched sample design. In the matched sample design each element provides a pair of 
data values, one from each population. The difference between the paired data values is 
then used in the statistical analysis. The matched sample design is generally preferred to 
the independent sample design because the matched sample procedure often improves the 
precision of the estimate. 

Finally, interval estimation and hypothesis testing about the difference between two 
population proportions were discussed. Statistical procedures for analyzing the difference 
between two population proportions are similar to the procedures for analyzing the differ- 
ence between two population means. 

We showed how Excel can be used to develop the interval estimates and conduct the 
hypothesis tests discussed in the chapter. Many of the worksheets can be used as templates 
to solve problems encountered in practice as well as the exercises and case problems found 
in the text. 


GLOSSARY | . [ore . a 


Independent random samples Samples selected from two populations in such a way that 
the elements making up one sample are chosen independently of the elements making up 
the other sample. 

Matched samples Samples in which each data value of one sample is matched with a 
corresponding data value of the other sample. 

Pooled estimator of p An estimator of a population proportion obtained by computing a 
weighted average of the point estimators obtained from two independent samples. 


KEY FORMULAS 


@eceeeeeeeeoeeeeeeeeeeoeeeeeeeeeeeoeeeeeese 


Point Estimator of the Difference Between Two Population Means 


Xj = X (10.1) 
Standard Error of x, — x, 
o  % 
o -=4/—+— (10.2) 
XxX, — Xy ny Ny 


Interval Estimate of the Difference Between Two Population Means: g, and o, Known 


x, 7T X5 fum cal? n a ny (10.4) 


Test Statistic for Hypothesis Tests About w,—p,: o, and 0, Known 


(x, — x) —D 
z= = (10.5) 

oi n 

— + — 

ny n, 
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Interval Estimate of the Difference Between Two Population Means: g, and o, Unknown 


2 2 
E PR sy S2 
XTn = tan 

ni h 


df = 


n = 1\n, n, — 1\m 


Test Statistic for Hypothesis Tests About w,— m,: 0, and o, Unknown 


_ (x, — x) — Do 
io S 
n Ny 


Test Statistic for Hypothesis Tests Involving Matched Samples 


= d~ Ba 
s,/Vn 


Point Estimator of the Difference Between Two Population Proportions 
PiP 
Standard Error of p, — p, 


p(l =p) pl =p) 
On, — po ~ + 


n Ny 


Interval Estimate of the Difference Between Two Population Proportions 


ARAT, Va =p) paN — Po) 
1 2 = Zan 


nı Ny 


Standard Error of p, — p, When p; = p, =p 


Pooled Estimator of p When p, = p, = p 


NP, + nP 


p= 
n + Ny 


Test Statistic for Hypothesis Tests About p, — p, 


(Pi = Po) 


En 


(10.6) 


(10.7) 


(10.8) 


(10.9) 


(10.10) 


(10.11) 


(10.13) 


(10.14) 


(10.15) 


(10.16) 
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SUPPLEMENTARY EXERCISES 


38. Supermarket Checkout Lane Design. Safegate Foods, Inc., is redesigning the check- 
out lanes in its supermarkets throughout the country and is considering two designs. 
Tests on customer checkout times conducted at two stores where the two new systems 
have been installed result in the following summary of the data. 


System A System B 

n, = 120 n, = 100 

X, = 4.1 minutes X = 3.4 minutes 
go, = 2.2 minutes o, = 1.5 minutes 


Test at the .05 level of significance to determine whether the population mean check- 
out times of the two systems differ. Which system is preferred? 
. 39. SUV Lease Payments. Statista reports that the average monthly lease payment 

DATA file for an automobile is falling in the United States (Statista.com), but does this apply 
to all classes of automobiles? Suppose you are interested in whether this trend is 
true for sport utility vehicles (SUVs). The file SUVLease contains monthly lease 
payment data for 33 randomly selected SUVs in 2015 and 46 randomly selected 
SUVs in 2016. 

a. Provide and interpret a point estimate of the difference between the population 
mean monthly lease payments for the two years. 

b. Develop a 99% confidence interval estimate of the difference between the mean 
monthly lease payments in 2015 and 2016. 

c. Would you feel justified in concluding that monthly lease payments have declined 
from 2015 to 2016? Why or why not? 

40. Load versus No-Load Mutual Funds. Mutual funds are classified as load or no- 
load funds. Load funds require an investor to pay an initial fee based on a percent- 
age of the amount invested in the fund. The no-load funds do not require this initial 
fee. Some financial advisors argue that the load mutual funds may be worth the 
extra fee because these funds provide a higher mean rate of return than the no- 
load mutual funds. A sample of 30 load mutual funds and a sample of 30 no-load 
mutual funds were selected. Data in the file Mutual were collected on the annual 
return for the funds over a five-year period. The data for the first five load and first 
five no-load mutual funds are as follows. 


(@ 


SUVLease 


DATA file 


Mutual 


(@ 


Mutual Funds—Load Return Mutual Funds—No-Load Return 
American National Growth 15S) Amana Income Fund 13.24 
Arch Small Cap Equity 14.57 Berger One Hundred Zs 
Bartlett Cap Basic TS Columbia International Stock WAM 
Calvert World International 10.31 Dodge & Cox Balanced 16.06 
Colonial Fund A 16.23 Evergreen Fund 17.61 


a. Formulate H) and H, such that rejection of H} leads to the conclusion that the load 
mutual funds have a higher mean annual return over the five-year period. 
b. Conduct the hypothesis test. What is the p-value? At a = .05, what is your conclusion? 
41. Kitchen Versus Bedroom Remodeling Costs. The National Association of Home 
Builders provided data on the cost of the most popular home remodeling projects. 
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Sample data on cost in thousands of dollars for two types of remodeling projects are 


as follows. 
Kitchen Master Bedroom Kitchen Master Bedroom 
2572 18.0 230 17S 
17.4 22) 197 24.6 
22.8 26.4 16.9 21.0 
se) 24.8 218 
197 26.9 23.6 


a. Develop a point estimate of the difference between the population mean remodeling 
costs for the two types of projects. 

b. Develop a 90% confidence interval for the difference between the two population 
means. 

42. Effect of Siblings on SAT Scores. In Born Together—Reared Apart: The Landmark 
Minnesota Twin Study, Nancy Segal discusses the efforts of research psychologists at the 
University of Minnesota to understand similarities and differences between twins by study- 
ing sets of twins who were raised separately. Below are evidence-based reading and writing 
(EBRW) SAT scores for several pairs of identical twins (twins who share all of their genes) 
and were raised separately, one of whom was raised in a family with no other children 
(no siblings) and one of whom was raised in a family with other children (with siblings). 


No Siblings With Siblings 

Name EBRW SAT Score Name EBRW SAT Score 
Bob 440 Donald 420 
Matthew 610 Ronald 540 
Shannon 590 Kedriana 630 
Tyler 390 Kevin 430 
Michelle 410 Erin 460 
Darius 430 Michael 490 
Wilhelmina 510 Josephine 460 
Donna 620 Jasmine 540 
DATA file Drew 510 Kraig 460 
Twins Lucinda 680 Bernadette 650 
Barry 580 Larry 450 
Julie 610 Jennifer 640 
Hannah 510 Diedra 460 
Roger 630 Latishia 580 
Garrett 570 Bart 490 
Roger 630 Kara 640 
Nancy 530 Rachel 560 
Sam 590 Joey 610 
Simon 500 Drew 520 
Megan 610 Annie 640 


a. What is the mean difference between the EBRW SAT scores for the twins raised 
with no siblings and the twins raised with siblings? 

b. Provide a 90% confidence interval estimate of the mean difference between the 
EBRW SAT scores for the twins raised with no siblings and the twins raised with 
siblings. 
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c. Conduct a hypothesis test of equality of the critical reading SAT scores for the 
twins raised with no siblings and the twins raised with siblings at a = .01. What is 
your conclusion? 

43. Change in Financial Security. Country Financial, a financial services company, uses 
surveys of adults age 18 and older to determine whether personal financial fitness is 
changing over time. A recent sample of 1000 adults showed 410 indicating that their 
financial security was more than fair. Just a year prior, a sample of 900 adults showed 
315 indicating that their financial security was more than fair. 

a. State the hypotheses that can be used to test for a significant difference between the 
population proportions for the two years. 

b. Conduct the hypothesis test and compute the p-value. At a .05 level of significance, 
what is your conclusion? 

c. What is the 95% confidence interval estimate of the difference between the two 
population proportions? What is your conclusion? 

44. Differences in Insurance Claims Based on Marital Status. A large automobile insur- 
ance company selected samples of single and married male policyholders and recorded 
the number who made an insurance claim over the preceding three-year period. 


Single Policyholders Married Policyholders 
n, = 400 n, = 900 
Number making claims = 76 Number making claims = 90 


a. Use a = .05. Test to determine whether the claim rates differ between single and 
married male policyholders. 

b. Provide a 95% confidence interval for the difference between the proportions for 
the two populations. 

45. Drug-Resistant Gonorrhea. Each year, over 2 million people in the United States 
become infected with bacteria that are resistant to antibiotics. In particular, the Centers 
of Disease Control and Prevention has launched studies of drug-resistant gonorrhea 
(CDC. gov website). Of 142 cases tested in Alabama, 9 were found to be drug-resistant. 
Of 268 cases tested in Texas, 5 were found to be drug-resistant. Do these data suggest 
a Statistically significant difference between the proportions of drug-resistant cases in 
the two states? Use a .02 level of significance. What is the p-value, and what is your 


conclusion? 
ee DATA fi 46. News Access Via Computer. The American Press Institute reports that almost 70% 
= file of all American adults use a computer to gain access to news. Suppose you suspect 
ComputerNews that the proportion of American adults under 30 years old who use a computer to 


gain access to news exceeds the proportion of Americans at least 30 years old who 

use a computer to gain access to news. Data in the file ComputerNews represent 

responses to the question “Do you use a computer to gain access to news?” given by 
random samples of American adults under 30 years old and Americans who are at 
least 30 years old. 

a. Estimate the proportion of American adults under 30 years old who use a computer 
to gain access to news and the proportion of Americans at least 30 years old who 
use a computer to gain access to news. 

b. Provide a 95% confidence interval for the difference in these proportions. 

c. On the basis of your findings, does it appear the proportion of American adults 
under 30 years old who use a computer to gain access to news exceeds the propor- 
tion of Americans who are at least 30 years old that use a computer to gain access 
to news? 

47. For the week ended January 15, 2009, the bullish sentiment of individual investors 
was 27.6% (AAII Journal). The bullish sentiment was reported to be 48.7% one 

week earlier and 39.7% one month earlier. The sentiment measures are based on a 
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poll conducted by the American Association of Individual Investors. Assume that each 

of the bullish sentiment measures was based on a sample size of 240. 

a. Develop a 95% confidence interval for the difference between the bullish sentiment 
measures for the most recent two weeks. 

b. Develop hypotheses so that rejection of the null hypothesis will allow us to con- 
clude that the most recent bullish sentiment is weaker than that of one month ago. 

c. Conduct a hypotheses test of part (b) using œ = .01. What is your conclusion? 


CASE PROBLEM: PAR, INC. 


Par, Inc., is a major manufacturer of golf equipment. Management believes that Par’s 
market share could be increased with the introduction of a cut-resistant, longer-lasting golf 
ball. Therefore, the research group at Par has been investigating a new golf ball coating 
designed to resist cuts and provide a more durable ball. The tests with the coating have 
been promising. 

One of the researchers voiced concern about the effect of the new coating on driving 
distances. Par would like the new cut-resistant ball to offer driving distances comparable 
to those of the current-model golf ball. To compare the driving distances for the two balls, 
40 balls of both the new and current models were subjected to distance tests. The testing 
was performed with a mechanical hitting machine so that any difference between the mean 
distances for the two models could be attributed to a difference in the two models. The 
results of the tests, with distances measured to the nearest yard, follow. These data are 
available in the file Golf. 


Model Model Model Model 

Current New Current New Current New Current New 

264 277 270 272 263 274 281 283 

261 269 287 259 264 266 274 250 

ae l 267 263 289 264 284 22 273 253 
S= DATA file 272 266 280 280 263 271 263 260 
Golf 258 262 272 274 260 260 275 270 

283 251 275 281 283 281 267 263 

258 262 265 276 255 250 279 261 

266 289 260 269 272 263 274 255 

259 286 278 268 266 278 276 263 

270 264 275 262 268 264 262 279 


Managerial Report 


1. Formulate and present the rationale for a hypothesis test that Par could use to compare 
the driving distances of the current and new golf balls. 

2. Analyze the data to provide the hypothesis testing conclusion. What is the p-value for 

your test? What is your recommendation for Par, Inc.? 

Provide descriptive statistical summaries of the data for each model. 

4. What is the 95% confidence interval for the population mean driving distance of each 
model, and what is the 95% confidence interval for the difference between the means of 
the two populations? 

5. Do you see a need for larger sample sizes and more testing with the golf balls? Discuss. 


Nad 
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490 Chapter 11 


Inferences About Population Variances 


STATISTICS IN PRACTICE 


U.S. Government Accountability Office* 
WASHINGTON, D.C. 


The U.S. Government Accountability Office (GAO) is 
an independent, nonpolitical audit organization in the 
legislative branch of the federal government. GAO 
evaluators determine the effectiveness of current and 
proposed federal programs. To carry out their duties, 
evaluators must be proficient in records review, legisla- 
tive research, and statistical analysis techniques. 

In one case, GAO evaluators studied a Department 
of Interior program established to help clean up the 
nation’s rivers and lakes. As part of this program, federal 
grants were made to small cities throughout the United 
States. Congress asked the GAO to determine how 
effectively the program was operating. To do so, the 
GAO examined records and visited the sites of several 
waste treatment plants. 

One objective of the GAO audit was to ensure that 
the effluent (treated sewage) at the plants met certain 
standards. Among other things, the audits reviewed 
sample data on the oxygen content, the pH level, 
and the amount of suspended solids in the effluent. A 
requirement of the program was that a variety of tests 
be taken daily at each plant and that the collected 
data be sent periodically to the state engineering 
department. The GAO's investigation of the data 
showed whether various characteristics of the effluent 
were within acceptable limits. 

For example, the mean or average pH level of the 
effluent was examined carefully. In addition, the variance 
in the reported pH levels was reviewed. The following 
hypothesis test was conducted about the variance in 
pH level for the population of effluent. 


Ho: g= a 

H,: 07 # 0% 
In this test, of is the population variance in pH level ex- 
pected at a properly functioning plant. In one particular 


*The authors thank Mr. Art Foreman and Mr. Dale Ledman of the 
U.S. Government Accountability Office for providing this Statistics 
in Practice. 


Effluent at this facility must fall within a statistically deter- 
mined pH range. Avatar_023/Shutterstock.com 


plant, the null hypothesis was rejected. Further analysis 
showed that this plant had a variance in pH level that 
was significantly less than normal. 

The auditors visited the plant to examine the mea- 
suring equipment and to discuss their statistical findings 
with the plant manager. The auditors found that the 
measuring equipment was not being used because 
the operator did not know how to work it. Instead, the 
operator had been told by an engineer what level of pH 
was acceptable and had simply recorded similar values 
without actually conducting the test. The unusually low 
variance in this plant's data resulted in rejection of Hp. 
The GAO suspected that other plants might have sim- 
ilar problems and recommended an operator training 
program to improve the data collection aspect of the 
pollution control program. 

In this chapter you will learn how to conduct 
statistical inferences about the variances of one and 
two populations. Two new distributions, the chi-square 
distribution and the F distribution, will be introduced 
and used to make interval estimates and hypothesis 
tests about population variances. 


In this chapter we examine methods of statistical inference involving population variances. 
As an example of a case in which a variance can provide important decision-making 
information, consider the production process of filling containers with a liquid detergent 
product. The filling mechanism for the process is adjusted so that the mean filling weight 
is 16 ounces per container. Although a mean of 16 ounces is desired, the variance of the 
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In many manufacturing 
applications, controlling 
the process variance is 
extremely important in 
maintaining quality. 


The chi-square distribution 
is based on sampling from 
a normal population. 
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filling weights is also critical. That is, even with the filling mechanism properly adjusted 
for the mean of 16 ounces, we cannot expect every container to have exactly 16 ounces. 
By selecting a sample of containers, we can compute a sample variance for the number 
of ounces placed in a container. This value will serve as an estimate of the variance for 
the population of containers being filled by the production process. If the sample variance 
is modest, the production process will be continued. However, if the sample variance is 
excessive, overfilling and underfilling may be occurring even though the mean is correct at 
16 ounces. In this case, the filling mechanism will be readjusted in an attempt to reduce the 
filling variance for the containers. 

In the first section we consider inferences about the variance of a single population. 
Subsequently, we will discuss procedures that can be used to make inferences about the 
variances of two populations. 


11.1 Inferences About a Population Variance 
The sample variance 


a Oia 


n—- 1 


(11.1) 


is the point estimator of the population variance o”. In using the sample variance as a basis 
for making inferences about a population variance, the sampling distribution of the quan- 
tity (n — 1)s*/o? is helpful. This sampling distribution is described as follows. 


SAMPLING DISTRIBUTION OF (n — 1)s?/c? 


Whenever a random sample of size n is selected from a normal population, the 
sampling distribution of 


(n= 1)? 
o 


has a chi-square distribution with n — 1 degrees of freedom. 


(11.2) 


Figure 11.1 shows some possible forms of the sampling distribution of (n — 1)s’/o°. 

Because the sampling distribution of (n — 1)s?/o* is known to have a chi-square distribu- 
tion whenever a random sample of size n is selected from a normal population, we can use 
the chi-square distribution to develop interval estimates and conduct hypothesis tests about 
a population variance. 


Interval Estimation 


To show how the chi-square distribution can be used to develop a confidence interval 
estimate of a population variance o”, suppose that we are interested in estimating the 
population variance for the production filling process mentioned at the beginning of this 
chapter. A sample of 20 containers is taken, and the sample variance for the filling quan- 
tities is found to be s* = .0025. However, we know we cannot expect the variance of a 
sample of 20 containers to provide the exact value of the variance for the population of 
containers filled by the production process. Hence, our interest will be in developing an 
interval estimate for the population variance. 

We will use the notation y4 to denote the value for the chi-square distribution that 
provides an area or probability of a to the right of the x2 value. For example, in 
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FIGURE 11.1 Examples of the Sampling Distribution of (n — 1)s*/o* 


(a Chi-Square Distribution) 


With 2 degrees of freedom 


With 5 degrees of freedom 


With 10 degrees of freedom 


Figure 11.2 the chi-square distribution with 19 degrees of freedom is shown with 
X05 = 32.852 indicating that 2.5% of the chi-square values are to the right of 32.852, 
and x;5 = 8.907 indicating that 97.5% of the chi-square values are to the right of 
8.907. Tables of areas or probabilities are readily available for the chi-square distri- 
bution. Refer to Table 11.1 and verify that these chi-square values with 19 degrees of 
freedom (19th row of the table) are correct. Table 3 of Appendix B (available online) 
provides a more extensive table of chi-square values. 

From the graph in Figure 11.2 we see that .95, or 95%, of the chi-square values 
are between y%7; and Ys. That is, there is a .95 probability of obtaining a x” value 
such that 


Oe. nO i OD 
X975 = x = X 025 


FIGURE 11.2 A Chi-Square Distribution with 19 Degrees of Freedom 


.95 of the 
possible y 2 value 


0 8.907 32.852 


2 2 
X 975 X 025 
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TABLE 11.1 


493 


Selected Values from the Chi-Square Distribution Table* 


Area or 


probability 


Degrees 
of Freedom 


a ae ee ee ey 
OMmANA UN BRWNHAO DANA AWN = 


NM NMNN NH 
BWNH- Oo 


25 
26 
EU 
28 
29 


30 
40 
60 
80 
100 


99 


.000 
.020 
A5 
Ly 


054 
.872 
1239 
1.647 
2.088 


2.558 
3.053 
SSTA 
4.107 
4.660 


5.229 
5.812 
6.408 
7.015 
7-033 


8.260 
8.897 
9.542 
10.196 
10.856 


11.524 
12.198 
12.878 
13.565 
14.256 


14.953 
22.164 
37.485 
53.540 
70.065 


975 


.001 
.051 
216 
.484 


831 
17237 
1.690 
2.180 
2.700 


3.247 
3.816 
4.404 
5.009 
5.629 


6.262 
6.908 
7.564 
8.231 
8.907 


92591 
10.283 
10.982 
11.689 
12.401 


13.120 
13.844 
14.573 
15.308 
16.047 


16.791 
24.433 
40.482 
5758 
74.222 


95 


.004 
103 
352 
Z 


1.145 
1.635 
2.167 
2.733 
3.325 


3.940 
4.575 
5.226 
5.892 
6.571 


7.261 
7.962 
8.672 
27320 
107 


10.851 
591 
12.338 
13.091 
13.848 


14.611 
15 372 
16151 
16.928 
17.708 


18.493 
26.509 
43.188 
60.391 
77.929 


Xa 


Area in Upper Tail 


.90 


016 
2 
.584 
1.064 


1.610 
2.204 
2.833 
3.490 
4.168 


4.865 
5.578 
6.304 
7.041 
7.790 


8.547 
32 
10.085 
10.865 
11.651 


12.443 
13.240 
14.041 
14.848 
115.659. 


16.473 
17222 
18.114 
18.939 
19.768 


20.599 
29.051 
46.459 
64.278 
82.358 


-10 


2.706 
4.605 
6.251 
LADD 


9.236 
10.645 
12.017 
13-362 
14.684 


15.987 
17275 
18.549 
19.812 
21.064 


22.307 
23.542 
24.769 
25.989 
27.204 


28.412 
29.615 
30.813 
32.007 
33.196 


34.382 
35.563 
36.741 
37.916 
39.087 


40.256 
51.805 
74.397 
96.578 
118.498 


.05 


3.841 
5991 
7.815 
9.488 


11.070 
12592 
14.067 
15.507 
16.919 


18.307 
19.675 
21.026 
22.362 
23.685 


24.996 
26.296 
27.587 
28.869 
30.144 


31.410 
32.671 
33.924 
35.172 
36.415 


37.652 
38.885 
40.113 
41.337 
42.557 


43.773 
55.758 
79.082 
101.879 
124.342 


*Note: A more extensive table is provided as Table 3 of Appendix B (available online). 


.025 


5.024 
7.378 
9.348 
11.143 


12.832 
14.449 
16.013 
17.535 
19.023 


20.483 
21.920 
23.337. 
24.736 
269. 


27.488 
28.845 
30.191 
31.526 
32.852 


34.170 
35.479 
36.781 
38.076 
39.364 


40.646 
41.923 
43.195 
44.461 
45.722 


46.979 
59.342 
83.298 
106.629 
129.561 


.01 


6.635 
2210 
11.345 
1277 


15.086 
16.812 
18.475 
20.090 
21.666 


23.209 
24.725 
26.217 
27.688 
29.141 


30.578 
32.000 
33.409 
34.805 
36191 


37.566 
38.932 
40.289 
41.638 
42.980 


44.314 
45.642 
46.963 
48.278 
49.588 


50.892 
63.691 
88.379 
11121329 
135.807 
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We stated in expression (11.2) that (n — 1)s’/o* follows a chi-square distribution; therefore 
we can substitute (n — 1)s?/o7 for xX and write 


(n — 1)? 
Xors = = = Xis (11.3) 


In effect, expression (11.3) provides an interval estimate in that .95, or 95%, of all possible 
values for (n — 1)s’/o? will be in the interval x475 to X%2s- We now need to do some alge- 
braic manipulations with expression (11.3) to develop an interval estimate for the popula- 
tion variance o*. Working with the leftmost inequality in expression (11.3), we have 


K — (n= 1)s* 
‘975 = = 
Thus 
OX 975 s(n- 1)? 
or 


opg 
2 (n — 1)s 
Xors 


Performing similar algebraic manipulations with the rightmost inequality in expression 
(11.3) gives 


o (11.4) 


zng 
(n : 1)s z 
X025 


r (11.5) 


The results of expressions (11.4) and (11.5) can be combined to provide 


(n — 1)s? we (n — 1)s? 


2 2 
X 025 X 975 


(11.6) 


Because expression (11.3) is true for 95% of the (n — 1)s’/o° values, expression (11.6) 
provides a 95% confidence interval estimate for the population variance o°. 

Let us return to the problem of providing an interval estimate for the population 
variance of filling quantities. Recall that the sample of 20 containers provided a sample 
variance of s* = .0025. With a sample size of 20, we have 19 degrees of freedom. As 
shown in Figure 11.2, we have already determined that y%4,; = 8.907 and yx; = 32.852. 
Using these values in expression (11.6) provides the following interval estimate for the 
population variance. 


(19)(.0025) ayes (19)(.0025) 
32.852 =~ 8.907 


or 
.0014 =< o° < .0053 


Taking the square root of these values provides the following 95% confidence interval for 
the population standard deviation. 


.0380 = ø = .0730 


Thus, we illustrated the process of using the chi-square distribution to establish interval 
estimates of a population variance and a population standard deviation. Note specifically 
that because x7; and x‘); were used, the interval estimate has a .95 confidence coefficient. 
Extending expression (11.6) to the general case of any confidence coefficient, we have the 
following interval estimate of a population variance. 
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INTERVAL ESTIMATE OF A POPULATION VARIANCE 


(n — 1)s* Ee (n — 1)s* 
Xp XG — al) 


where the X° values are based on a chi-square distribution with n — 1 degrees of free- 
dom and where | — a is the confidence coefficient. 


Using Excel to Construct a Confidence Interval 


Excel can be used to construct a 95% confidence interval of the population variance for the 
example involving filling containers with a liquid detergent product. Refer to Figure 11.3 
as we describe the tasks involved. The formula worksheet is in the background; the value 
worksheet is in the foreground. 


(11.7) 


(@ 


DATA fi le Enter/Access Data: Open the file Detergent. Column A shows the number of ounces of 


detergent for each of the 20 containers. 
Detergent 


Enter Functions and Formulas: The descriptive statistics needed are provided in cells 
D3:D4. Excel’s COUNT and VAR.S functions are used to compute the sample size and the 
sample variance, respectively. 


Cells D6:D9 are used to compute the appropriate chi-square values. The confidence 
coefficient was entered in cell D6 and the level of significance (œ) was computed in cell 
D7 by entering the formula =/-D6. Excel’s CHISQ.INV function was used to compute 
the lower tail chi-square value. The form of the CHISQ.INV function is CHISQ.INV 
(lower tail probability, degrees of freedom). The formula =CHISQ.INV(D7/2,D3-1) was 
entered in cell D8 to compute the chi-square value in the lower tail. The value worksheet 
shows that the chi-square value for 19 degrees of freedom is x%4,; = 8.9065. Then, to 
compute the chi-square value corresponding to an upper tail probability of .025, the 
formula =CHISQ.INV.RT(D7/2,D3-1) was entered in cell D9. The value worksheet shows 
that the chi-square value obtained is v4); = 32.8523. 


FIGURE 11.3 | Excel Worksheet for the Liquid Detergent Filling Process 


B =: en D E 
Interval Estimate of a Population Variance 


Sample Size =COUNT(A2:A21) Amin c D E 
Variance SSSeeeae) 1 Interval Estimate of a Population Variance 
Confidence Coefficient 0.95 z j oz Sample Size 20 
Level of Significance (alpha) =1-D6 4 Variance 0.0025 
Chi-Square Value (lower tail) =CHISQINV(D7/2,D3-1) < ; 
Chi-Square Value (upper tail) =CHISQ.INV.RT(D7/2.D3-1) 6 P E RD Cuciticieat 0.95 
7 Level of Significance (alpha) 0.05 
a Yaa ee 8 Chi-Square Value (lower tail) 8.9065 
A 9. | Chi-Square Value (upper tail) 32.8523 
Upper Limit =((D3-1)*D4)/DS 10. 
n| Point Estimate 0.0025 
12 Lower Limit 0.0014 
13 Upper Limit 0.0053 
14 j 

15 

16 

17 

18 

19 

20 

21 

22 
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Cells D11:D13 provide the point estimate and the lower and upper limits for the confi- 
dence interval. Because the point estimate is just the sample variance, we entered the for- 
mula =D4 in cell D11. Inequality (11.7) shows that the lower limit of the 95% confidence 
interval is 


(n — 1)s? _ a= 1)s? 
Xan Xos 


Thus, to compute the lower limit of the 95% confidence interval, the formula 
=((D3-1)*D4)/D9 was entered in cell D12. Inequality (11.7) also shows that the upper 
limit of the confidence interval is 


(n= 1)s? _ (n= 1)s? 
X 1 — a/2) Xs 


Thus, to compute the upper limit of the 95% confidence interval, the formula =((D3-1)*D4)/D8 
was entered in cell D13. The value worksheet shows a lower limit of .0014 and an upper 
limit of .0053. In other words, the 95% confidence interval estimate of the population 
variance is from .0014 to .0053. 


Hypothesis Testing 
Using ø% to denote the hypothesized value for the population variance, the three forms for a 
hypothesis test about a population variance are as follows: 

Hio Su, H:i” SG, Hio =a; 

Hio<o, Hic>o H:c#o 


We discuss hypothesis These three forms are similar to the three forms used to conduct one-tailed and two-tailed 
tests about population hypothesis tests about population means and proportions. 

means and proportions in The procedure for conducting a hypothesis test about a population variance uses the 
Chapters 9 and 10. hypothesized value for the population variance a; and the sample variance s* to compute 


the value of a x” test statistic. Assuming that the population has a normal distribution, the 
test statistic is as follows. 


TEST STATISTIC FOR HYPOTHESIS TESTS ABOUT A POPULATION VARIANCE 


Vea is 
E 


2 


(11.8) 


where y” has a chi-square distribution with n — 1 degrees of freedom. 


After computing the value of the x” test statistic, either the p-value approach or the critical 
value approach may be used to determine whether the null hypothesis can be rejected. 

Let us consider the following example. The St. Louis Metro Bus Company wants to pro- 
mote an image of reliability by encouraging its drivers to maintain consistent schedules. As 
a standard policy the company would like arrival times at bus stops to have low variability. 
In terms of the variance of arrival times, the company standard specifies an arrival time 
variance of 4 or less when arrival times are measured in minutes. The following hypothe- 
sis test is formulated to help the company determine whether the arrival time population 
variance is excessive. 
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In tentatively assuming H) is true, we are assuming that the population variance of 
arrival times is within the company guideline. We reject H} if the sample evidence indicates 
that the population variance exceeds the guideline. In this case, follow-up steps should be 
taken to reduce the population variance. We conduct the hypothesis test using a level of 
significance of a = .05. 

DATA fil Suppose that a random sample of 24 bus arrivals taken at a downtown intersection 
J Ne provides a sample variance of s* = 4.9. Assuming that the population distribution of arrival 
times is approximately normal, the value of the test statistic is as follows: 


(@ 


BusTimes 


> (n=1)? __ 24-1949) 
F 4 28.18 

The chi-square distribution with n — 1 = 24 — 1 = 23 degrees of freedom is shown in 

Figure 11.4. Because this is an upper tail test, the area under the curve to the right of the 

test statistic X? = 28.18 is the p-value for the test. 

Like the ż distribution table, the chi-square distribution table does not contain sufficient 
detail to enable us to determine the p-value exactly. However, we can use the chi-square 
distribution table to obtain a range for the p-value. For example, using Table 11.1, we find 
the following information for a chi-square distribution with 23 degrees of freedom. 


Area in Upper Tail 10 05 025 01 


X Value (23 df) 32.007 35.172 38.076 41.638 


ji 


Using Excel, p-value = Because y* = 28.18 is less than 32.007, the area in the upper tail (the p-value) is greater 
CHISQ.DIST.RT(28.18,23) than .10. With the p-value > a@ = .05, we cannot reject the null hypothesis. The sample does 
= .2091. not support the conclusion that the population variance of the arrival times is excessive. 
As with other hypothesis testing procedures, the critical value approach can also be used 
to draw the hypothesis testing conclusion. With a = .05, į; provides the critical value for 
the upper tail hypothesis test. Using Table 11.1 and 23 degrees of freedom, y; = 35.172. 
Thus, the rejection rule for the bus arrival time example is as follows: 


28.18 


Reject Hy if x? = 35.172 


Because the value of the test statistic is X? = 28.18, we cannot reject the null hypothesis. 


FIGURE 11.4 Chi-Square Distribution for the St. Louis Metro Bus Example 


p-value 


0 28.18 
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In practice, upper tail tests as presented here are the most frequently encountered tests 
about a population variance. In situations involving arrival times, production times, filling 
weights, part dimensions, and so on, low variances are desirable, whereas large variances 
are unacceptable. With a statement about the maximum allowable population variance, we 
can test the null hypothesis that the population variance is less than or equal to the max- 
imum allowable value against the alternative hypothesis that the population variance is 
greater than the maximum allowable value. With this test structure, corrective action will 
be taken whenever rejection of the null hypothesis indicates the presence of an excessive 
population variance. 

As we saw with population means and proportions, other forms of hypothesis tests can 
be developed. Let us demonstrate a two-tailed test about a population variance by consider- 
ing a situation faced by a bureau of motor vehicles. Historically, the variance in test scores 
for individuals applying for driver’s licenses has been o° = 100. A new examination with 
new test questions has been developed. Administrators of the bureau of motor vehicles 
would like the variance in the test scores for the new examination to remain at the historical 
level. To evaluate the variance in the new examination test scores, the following two-tailed 
hypothesis test has been proposed. 


Hy; 0% = 100 
H; o # 100 


Rejection of Hy will indicate that a change in the variance has occurred and suggest that 
some questions in the new examination may need revision to make the variance of the 
new test scores similar to the variance of the old test scores. A sample of 30 applicants for 
driver’s licenses will be given the new version of the examination. We will use a level of 
significance a = .05 to conduct the hypothesis test. 

The sample of 30 examination scores provided a sample variance s* = 162. The value of 
the chi-square test statistic is as follows: 


(n= 1)? (30 — 1)(162) 
o o 100 


2 


= 46.98 


Now, let us compute the p-value. Using Table 11.1 and n — 1 = 30 — 1 = 29 degrees of 
freedom, we find the following: 


Area in Upper Tail .10 .05 .025 .01 
X Value (29 df) 39.087 42.557 45.722 49.588 
X = 46.98 
Using Excel, p-value = Thus, the value of the test statistic x? = 46.98 provides an area between .025 and .01 in the 
2*CHISQ.DIST.RT(46.98,29) upper tail of the chi-square distribution. Doubling these values shows that the two-tailed 
= .0374. p-value is between .05 and .02. Excel can be used to show that y* = 46.98 provides a 


p-value = .0374. With p-value = a = .05, we reject Hy and conclude that the new ex- 
amination test scores have a population variance different from the historical variance of 
o = 100. A summary of the hypothesis testing procedures for a population variance is 
shown in Table 11.2. 


Using Excel to Conduct a Hypothesis Test 


In Chapters 9 and 10 we used Excel to conduct a variety of hypothesis tests. The procedure 
was general in that, once the test statistic was computed, three p-values were obtained: a 
p-value (Lower Tail), a p-value (Upper Tail), and a p-value (Two Tail). Then, depending 

on the form of the hypothesis test, the appropriate p-value was used to make the rejection 
decision. We will now adapt that approach and use Excel to conduct a hypothesis test about 
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TABLE 11.2 Summary of Hypothesis Tests About a Population Variance 


Lower Tail Test Upper Tail Test Two-Tailed Test 


he of = Of theo Som hear = oe 
Hypotheses Hea < ap hear > aR H,: o° # 06 
(n — 1)s? (n — 1)s? (n= 1)s? 
T T 2 r= LAR eS 
est Statistic X p X 2 X e 
Rejection Rule: Reject Hp if Reject H if Reject Hp if 
p-Value Approach p-value < a p-value <a p-value = a 
Rejection Rule: Reject Ho if Reject Hg if Reject Ho if 
Critical Value AEN KEX: EK 
Approach or if 
Rae 


a population variance. The St. Louis Metro Bus example will serve as an illustration. Refer 
to Figure 11.5 as we describe the tasks involved. The formula worksheet is in the back- 
ground; the value worksheet is in the foreground. 


Enter/Access Data: Open the file BusTimes. The 24 arrival times in number of minutes 
past 12:00 noon have been entered into column A. 


(Q 


DATA file 


BusTimes 
Enter Functions and Formulas: The descriptive statistics we need are provided in cells 


D3:D5. Excel’s COUNT function has been used to compute the sample size and Excel’s 
AVERAGE and VAR:S functions have been used to compute the sample mean and sample 
variance. The value worksheet shows n = 24, x = 14.76, and s? = 4.9. 


The hypothesized value of the population variance, oj = 4, was entered in cell D7. The 
formula =(D3-1)*D5/D7 was entered in cell D9 to compute the x” test statistic (28.18), 
and the formula =D3-/ was entered in cell D10 to compute the degrees of freedom associ- 
ated with the test statistic. 


FIGURE 11.5 | Hypothesis Test for Variance in Bus Arrival Times 


A B c D E 
1 Hypothesis Test About a Population Variance 
2 15.7 
3 16.9 Sample Size =COUNT(A2:A25) A c D E F 
4 128 Sample Mean =AVERAGE(A2:A25) eth ‘Test Abivis Pondiation Vetta 
5 15.6 Sample Variance =VAR.S(A2:A25) - a -n:a EOP = varaace 
614 | 
3 16.9 Sample Size 24 
7 162 Hypothesized Value 4 4 |) 438 Pa 14,76 
8 148 | p 
J S; le Varia 
9 19.8 Test Statistic =(D3-1)*D5/D7 ; | BR ampie aranca Sa 
10 13.2 Degrees of Freedom =D3-1 7 162 Hypothesized Value 4 
11 115.3 : 
a 8 148 
12 14.7 p-value (Lower Tail) =CHISQ.DIST(D9,D10, TRUE) o 198 Test Statistic 28.18 
13 10.2 p-value (Upper Tail) =CHISQDIST.RT(D9,D10) 10 6132 Degrees of Freedom 23 
14 16.7 p-value (Two Tail) =2*MIN(D12,D13) “o 153 
15 13.6 12 | 14.7 p-value (Lower Tail) 0.7909 
5 12.7 13. 102 p-value (Upper Tail) 0.2091 
14. 167 p-value (Two Tail) 0.4181 
15) 13.6 
235) 12.7 
Note: Rows 16-24 are hidden. 26 
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The lower and upper tail p-values are computed in cells D12 and D13. The CHISQ.DIST 
function is used to compute the lower tail p-value in cell D12. To compute the p-value for 
a one-tailed hypothesis test in which the rejection region is in the lower tail, we entered 
the formula =CHISQ.DIST(D9,D10,TRUE) in cell D12. The CHISQ.DIST-.RT function 
is used to compute the upper tail p-value in cell D13. To compute the p-value for a 
one-tailed hypothesis test in which the rejection region is in the upper tail, we entered the 
formula =CHISQ.DIST.RT(D9,D10) in cell D13. Finally, to compute the p-value for a 
two-tailed test, we entered the formula =2*MIN(D12,D13) in cell D14. The value work- 
sheet shows that p-value (Lower Tail) = .7909, p-value (Upper Tail) = .2091, and p-value 
(Two Tail) = .4181. The rejection region for the St. Louis Metro Bus example is in the 
upper tail; thus, the appropriate p-value is .2091. At a .05 level of significance, we cannot 
reject H, because .2091 > .05. Hence, the sample variance of s* = 4.9 is insufficient evi- 
dence to conclude that the arrival time variance is not meeting the company standard. 
To avoid having to revise This worksheet can be used as a template for other hypothesis tests about a population 
the cell ranges for func- variance. Enter the data in column A, revise the ranges for the functions in cells D3:D5 as 
tions in cells D3:D5, the appropriate for the data, and type the hypothesized value ø% in cell D7. The p-value appro- 
A:A method of specifying priate for the test can then be selected from cells D12:D14. The worksheet can also be used 
cell ranges can be used. for exercises in which the sample size, sample variance, and hypothesized value are given. 
For instance, in the previous subsection we conducted a two-tailed hypothesis test about 
the variance in scores on a new examination given by the bureau of motor vehicles. The 
sample size was 30, the sample variance was s* = 162, and the hypothesized value for 
the population variance was of = 100. Using the worksheet in Figure 11.5, we can type 30 
in cell D3, 762 in cell DS, and 700 in cell D7. The correct two-tailed p-value will then be 
given in cell D14 as .0374. 


SCHOHHSSHSSHSHSSHSHSHSSSSHSHSHSSHSHSHSSHSHSSHSHSHSHSHSHSSHSHSHSSHSHSHSHSSSHSHSHSSSSHSHSESSSEESSEE®E 


Methods 
1. Find the following chi-square distribution values from Table 11.1 or Table 3 of 
Appendix B (available online). 
a. Xas with df = 5 
b. X95 with df = 15 
c. X75 with df = 20 
d. yo, with df = 10 
e. Xos with df = 18 
2. A sample of 20 items provides a sample standard deviation of 5. 
a. Compute the 90% confidence interval estimate of the population variance. 
b. Compute the 95% confidence interval estimate of the population variance. 
c. Compute the 95% confidence interval estimate of the population standard deviation. 
3. A sample of 16 items provides a sample standard deviation of 9.5. Test the following 
hypotheses using a = .05. What is your conclusion? Use both the p-value approach 
and the critical value approach. 


H,: o = 50 
Hao > 50 


Applications 
4. Package Delivery by Drones. Amazon.com is testing the use of drones to deliver 

packages for same-day delivery. In order to quote narrow time windows, the variability 

in delivery times must be sufficiently small. Consider a sample of 24 drone deliveries 

with sample variance of s* = .81. 

a. Construct a 90% confidence interval estimate of the population variance for the 

drone delivery time. 

b. Construct a 90% confidence interval estimate of the population standard deviation. 
DATA file 5. College Basketball Coaches’ Salaries. In 2018, Mike Krzyewski and John Calipari 
CoachSalary topped the list of highest paid college basketball coaches (Sports Illustrated website). 


(@ 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


11.1 Inferences About a Population Variance 501 


The following sample shows the head basketball coach’s salary for a sample of 
10 schools playing NCAA Division I basketball. Salary data are in millions of dollars. 


University Coach's Salary University Coach's Salary 
North Carolina State 22 Miami (FL) 15 
lona 5 Creighton (ES) 
Texas A&M 2.4 Texas Tech TS 
Oregon 2 South Dakota State 3 
lowa State 2.0 New Mexico State B 


a. Use the sample mean for the 10 schools to estimate the population mean annual 
salary for head basketball coaches at colleges and universities playing NCAA Divi- 
sion I basketball. 

b. Use the data to estimate the population standard deviation for the annual salary for 
head basketball coaches. 

c. What is the 95% confidence interval for the population variance? 

d. What is the 95% confidence interval for the population standard deviation? 

6. Volatility of General Electric Stock. To analyze the risk, or volatility, associated with 
investing in General Electric common stock, consider a sample of the eight quarterly 
percent total returns. The percent total return includes the stock price change plus the 
dividend payment for the quarter. 


20.0 —20.5 12.2 12.6 10.5 =5.8 =18.7 15.3 


a. What is the value of the sample mean? What is its interpretation? 

b. Compute the sample variance and sample standard deviation as measures of volatil- 
ity for the quarterly return for General Electric. 

c. Construct a 95% confidence interval for the population variance. 

d. Construct a 95% confidence interval for the population standard deviation. 


=> . 7. Halloween Spending. In 2017, Americans spent a record-high $9.1 billion on 
== DATA file Hall i 
= alloween-related purchases (the balance website). Sample data showing the 
Halloween amount, in dollars, that 16 adults spent on a Halloween costume are as follows. 
12 69 22 64 
33 36 31 44 
52 16 13 98 
45 32 63 26 
a. What is the estimate of the population mean amount adults spend on a Halloween 
costume? 
b. What is the sample standard deviation? 
c. Provide a 95% confidence interval estimate of the population standard deviation for 
the amount adults spend on a Halloween costume. 
8. Variability in Daily Change in Stock Price. Consider a day when the Dow Jones 
Industrial Average went up 149.82 points. The following table shows the stock price 
changes for a sample of 12 companies on that day. 
Price Change Price Change 
Company ($) Company ($) 
æ i Aflac 81 Johnson & Johnson 1.46 
= DATA file Altice USA .41 Loews Corporation 22 
StockPriceChange Bank of America 705 Nokia Corporation 21 
Diageo plc 12 Sempra Energy Sl 
Fluor Corporation Desi) Sunoco LP 52 
Goodrich Petroleum BS Tyson Foods, Inc. 12 
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a. Compute the sample variance for the daily price change. 

b. Compute the sample standard deviation for the price change. 

c. Provide 95% confidence interval estimates of the population variance and the 
population standard deviation. 

9. Aerospace Part Manufacturing. The competitive advantage of some small American 
factories such as In Tolerance Contract Manufacturing lies in their ability to produce 
parts with very narrow requirements, or tolerances, that are typical in the aerospace 
industry. Consider a product with specifications that call for a maximum variance in 
the lengths of the parts of .0004. Suppose the sample variance for 30 parts turns out 
to be s? = .0005. Use a = .05 to test whether the population variance specification is 
being violated. 


Ss DATA fi le 10. Costco Customer Satisfaction. Consumer Reports uses a 100-point customer sat- 
= isfaction score to rate the nation’s major chain stores. Assume that from past expe- 
Costco rience with the satisfaction rating score, a population standard deviation of o = 12 

is expected. In 2012, Costco with its 432 warehouses in 40 states was the only chain 

store to earn an outstanding rating for overall quality. A sample of 15 Costco customer 

satisfaction scores follows. 
95 90 83 T5 95 
98 80 83 82 93 
86 80 94 64 62 

a. What is the sample mean customer satisfaction score for Costco? 

b. What is the sample variance? 

c. What is the sample standard deviation? 

d. Construct a hypothesis test to determine whether the population standard deviation 
of ø = 12 should be rejected for Costco. With a .05 level of significance, what is 
your conclusion? 

D DATA fi le 11. Variability in GMAT Scores. In 2016, the Graduate Management Admission Council 
= reported that the variance in GMAT scores was 14,660. At a recent summit, a group 


EconGMAT of economics professors met to discuss the GMAT performance of undergraduate stu- 


dents majoring in economics. Some expected the variability in GMAT scores achieved 

by undergraduate economics students to be greater than the variability in GMAT 

scores of the general population of GMAT takers. However, others took the opposite 
view. The file EconGMAT contains GMAT scores for 51 randomly selected undergrad- 
uate students majoring in economics. 

a. Compute the mean, variance, and standard deviation of the GMAT scores for the 
51 randomly selected undergraduate students majoring in economics. 

b. Develop hypotheses to test whether the sample data indicate that the variance in 
GMAT scores for undergraduate students majoring in economics differs from the 
general population of GMAT takers. 

c. Use a = .05 to conduct the hypothesis test formulated in part (b). What is your 
conclusion? 

12. Vehicle Ownership by Fortune Magazine Subscribers. A Fortune study found that 
the variance in the number of vehicles owned or leased by subscribers to Fortune 
magazine is .94. Assume a sample of 12 subscribers to another magazine provided the 
following data on the number of vehicles owned or leased: 2, 1, 2, 0,3, 2, 2, 1, 2, 1, 0, 
and 1. 

a. Compute the sample variance in the number of vehicles owned or leased by the 
12 subscribers. 

b. Test the hypothesis H,: o° = .94 to determine whether the variance in the 
number of vehicles owned or leased by subscribers of the other magazine 
differs from o? = .94 for Fortune. At a .05 level of significance, what is your 
conclusion? 
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11.2 Inferences About Two Population Variances 


In some statistical applications we may want to compare the variances in product quality 
resulting from two different production processes, the variances in assembly times for two 
assembly methods, or the variances in temperatures for two heating devices. In making 
comparisons about the two population variances, we will be using data collected from 
two independent random samples, one from population | and another from population 2. 
The two sample variances s; and så will be the basis for making inferences about the two 
population variances of and o4. Whenever the variances of two normal populations are 
equal (of = 05), the sampling distribution of the ratio of the two sample variances s{/s3 is 
as follows. 


SAMPLING DISTRIBUTION OF s?/s? WHEN of = of 


Whenever independent random samples of sizes n, and n, are selected from two nor- 
mal populations with equal variances, the sampling distribution of 


si 
E (11.9) 
52 
The F distribution is based has an F distribution with n, — 1 degrees of freedom for the numerator and n, — 1 
on sampling from two degrees of freedom for the denominator; s; is the sample variance for the random sam- 
normal populations. ple of n, items from population 1 and så is the sample variance for the random sample 


of n, items from population 2. 


Figure 11.6 is a graph of the F distribution with 20 degrees of freedom for both the 
numerator and denominator. As indicated by this graph, the F distribution is not symmetric, 
and the F values can never be negative. The shape of any particular F distribution depends 
on its numerator and denominator degrees of freedom. 

We will use F, to denote the value of F that provides an area or probability of œ in 
the upper tail of the distribution. For example, as noted in Figure 11.6, F ọ; denotes the 
upper tail area of .05 for an F distribution with 20 degrees of freedom for the numerator 
and 20 degrees of freedom for the denominator. The specific value of F o5 can be found 
by referring to the F distribution table, a portion of which is shown in Table 11.3. Using 
20 degrees of freedom for the numerator, 20 degrees of freedom for the denominator, and 
the row corresponding to an area of .05 in the upper tail, we find F ọ, = 2.12. Note that 


FIGURE 11.6 F Distribution with 20 Degrees of Freedom for the Numerator 


and 20 Degrees of Freedom for the Denominator 


o 
N 
= 
N 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


504 Chapter 11 Inferences About Population Variances 


TABLE 11.3 Selected Values from the F Distribution Table* 


Area or 
probability 


0 F, 


Denominator Area in 


Degrees of Upper Numerator Degrees of Freedom 


Freedom Tail 10 15 20 25 30 
10 .10 2132 2.24 2.20 Di 2.16 
05 2.98 2.85 2 PAB 270 

1025 32 3752 3.42 3:35 Soil 

01 4.85 4.56 4.41 4.31 4.25 

15 .10 2.06 1.97 1.92 1.89 1.87 
105 2.54 2.40 233 2 28 2.25 

1025 3.06 2.86 Dole 2.69 2.64 

01 3.80 352 Bra, 8:28 BA 

20 si@) 1.94 1.84 (B79 EO 17A 
.05 235 2.20 212 2.07 2.04 

-025 Pit 257. 2.46 2.40 2.35 

01 D 3.09 2.94 2.84 279 

25 AO S7 aZ 172 1.68 1.66 
.05 2.24 2.09 2.01 1.96 1.92 

.025 2.61 2.41 2.30 223 28 

.01 SS 2185 2m0 2.60 2.54 

30 .10 1.82 1.72 1.67 1.63 Ol 
05 2.16 2.01 1.93 1.88 1.84 

.025 25 S| 2.20 DAZ 2.07 

01 2.98 270 2155 2.45 2.39 


*Note: A more extensive table is provided as Table 4 of Appendix B (available online). 


the table can be used to find F values for upper tail areas of .10, .05, .025, and .01. See 
Table 4 of Appendix B (available online) for a more extensive table for the F distribution. 

Let us show how the F distribution can be used to conduct a hypothesis test about the 
variances of two populations. We begin with a test of the equality of two population vari- 
ances. The hypotheses are stated as follows. 


Hy: of = o 
H: 07 4 05 


We make the tentative assumption that the population variances are equal. If Ho is rejected, 
we will draw the conclusion that the population variances are not equal. 
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The procedure used to conduct the hypothesis test requires two independent random 
samples, one from each population. The two sample variances are then computed. We refer 
to the population providing the /arger sample variance as population |. Thus, a sample size 
of n, and a sample variance of sj correspond to population 1, and a sample size of n, and a 
sample variance of s3 correspond to population 2. Based on the assumption that both pop- 
ulations have a normal distribution, the ratio of sample variances provides the following F 
test statistic. 


TEST STATISTIC FOR HYPOTHESIS TESTS ABOUT POPULATION VARIANCES WITH of = of 


r= (11.10) 


Denoting the population with the larger sample variance as population 1, the test statis- 
tic has an F distribution with n, — 1 degrees of freedom for the numerator and n, — 1 
degrees of freedom for the denominator. 


We always denote the Because the F test statistic is constructed with the larger sample variance s; in the numera- 
population with the tor, the value of the test statistic will be in the upper tail of the F distribution. Therefore, the 
larger sample variance as__ F ‘distribution table as shown in Table 11.3 and in Table 4 of Appendix B (available online) 
population 1. need only provide upper tail areas or probabilities. If we did not construct the test statistic 


in this manner, lower tail areas or probabilities would be needed. In this case, additional cal- 
culations or more extensive F distribution tables would be required. Let us now consider an 
example of a hypothesis test about the equality of two population variances. 

Dullus County Schools is renewing its school bus service contract for the coming year 
and must select one of two bus companies, the Milbank Company or the Gulf Park Com- 
pany. We will use the variance of the arrival or pickup/delivery times as a primary measure 
of the quality of the bus service. Low variance values indicate the more consistent and 
higher-quality service. If the variances of arrival times associated with the two services are 
equal, Dullus School administrators will select the company offering the better financial 
terms. However, if the sample data on bus arrival times for the two companies indicate a 
significant difference between the variances, the administrators may want to give special 
consideration to the company with the better or lower variance service. The appropriate 
hypotheses follow. 


5 Oe 
‘I 
Sh 


NNN 


Hy: 
H; C 


1 


If H, can be rejected, the conclusion of unequal service quality is appropriate. We will use 
a level of significance of a = .10 to conduct the hypothesis test. 
DATA fil A sample of 26 arrival times for the Milbank service provides a sample variance of 
f HE 48 and a sample of 16 arrival times for the Gulf Park service provides a sample variance 
of 20. Because the Milbank sample provided the larger sample variance, we will denote 
Milbank as population 1. Using equation (11.10), we find the value of the test statistic: 


(@ 


SchoolBus 


S 
F=- 


The corresponding F distribution has n, — 1 = 26 — 1 = 25 numerator degrees of freedom 
andn, — 1 = 16 — 1 = 15 denominator degrees of freedom. 

As with other hypothesis testing procedures, we can use the p-value approach or the 
critical value approach to obtain the hypothesis testing conclusion. Table 11.3 shows the 
following areas in the upper tail and corresponding F values for an F distribution with 
25 numerator degrees of freedom and 15 denominator degrees of freedom. 
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Area in Upper Tail 10 05 025 01 
F Value (df, = 25, df, = 15) 1.89 2.28 2.69 3.28 
F = 2.40 


Because F = 2.40 is between 2.28 and 2.69, the area in the upper tail of the distribu- 
tion is between .05 and .025. For this two-tailed test, we double the upper tail area, 
which results in a p-value between .10 and .05. Because we selected a = .10 as the 
level of significance, the p-value < a = .10. Thus, the null hypothesis is rejected. 

This finding leads to the conclusion that the two bus services differ in terms of pickup/ 
delivery time variances. The recommendation is that the Dullus County School admin- 
istrators give special consideration to the better or lower variance service offered by the 


Gulf Park Company. 
Using Excel, We can use Excel to show that the test statistic F = 2.40 provides a two-tailed 
2*FDIST.RT(2.4,25, 15) p-value = .0812. With .0812 < a = .10, the null hypothesis of equal population variances 
= .0812. is rejected. 


To use the critical value approach to conduct the two-tailed hypothesis test at the a = .10 
level of significance, we would select critical values with an area of a/2 = .10/2 = .05 in each 
tail of the distribution. Because the value of the test statistic computed using equation (11.10) 
will always be in the upper tail, we only need to determine the upper tail critical value. From 
Table 11.3, we see that F'); = 2.28. Thus, even though we use a two-tailed test, the rejection 
tule is stated as follows. 


Reject H, if F = 2.28 


Because the test statistic F = 2.40 is greater than 2.28, we reject H) and conclude that the 
two bus services differ in terms of pickup/delivery time variances. 

One-tailed tests involving two population variances are also possible. In this case, we 
use the F distribution to determine whether one population variance is significantly greater 
than the other. A one-tailed hypothesis test about two population variances will always be 
formulated as an upper tail test: 


Hy: of = o 

Hy oi > 0} 
A one-tailed hypothesis This form of the hypothesis test always places the p-value and the critical value in the up- 
test about two population per tail of the F distribution. As a result, only upper tail F values will be needed, simplify- 
variances can always be ing both the computations and the table for the F distribution. 
formulated as an upper tail Let us demonstrate the use of the F distribution to conduct a one-tailed test about 


test. This approach elimin- the variances of two populations by considering a public opinion survey. Samples of 

ates the need for lower tail 31 men and 41 women will be used to study attitudes about current political issues. 

F values. The researcher conducting the study wants to test to see whether the sample data 
indicate that women show a greater variation in attitude on political issues than men. 
In the form of the one-tailed hypothesis test given previously, women will be denoted 
as population | and men will be denoted as population 2. The hypothesis test will be 
stated as follows. 


«pe =< 
Ho: O Women = Cres 


How, > io 


women men 


A rejection of H gives the researcher the statistical support necessary to conclude that 
women show a greater variation in attitude on political issues. 

With the sample variance for women in the numerator and the sample variance for men 
in the denominator, the F distribution will have n, — 1 = 41 — 1 = 40 numerator degrees 
of freedom and n, — 1 = 31 — 1 = 30 denominator degrees of freedom. We will use a 
level of significance a = .05 to conduct the hypothesis test. The survey results provide a 
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TABLE 11.4 Summary of Hypothesis Tests About Two Population Variances 


Upper Tail Test 


the or Swe 

Hypotheses ee 
Test Statistic F= si 
S2 

Rejection Rule: Reject Hg if 

p-Value Approach p-value <a 

Rejection Rule: Reject Hp if 
Critical Value FEF, 


Approach 


Two-Tailed Test 


Hy: 04 = 05 
H: o? # oF 
Note: Population 1 
has the larger 
sample variance 
2 
Si 
E= 2 
2 
Reject Ho if 
p-value =a 
Reject Ho if 
RE le yp 


sample variance of s? = 120 for women and a sample variance of s} = 80 for men. The test 


statistic is as follows. 
3 


sı 120 


F= = < = 150 


ss 80 


Referring to Table 4 in Appendix B (available online), we find that an F distribution with 
40 numerator degrees of freedom and 30 denominator degrees of freedom has F ;y = 1.57. 
Because the test statistic F = 1.50 is less than 1.57, the area in the upper tail must be 


Using Excel, p-value = greater than .10. Thus, we can conclude that the p-value is greater than .10. Excel can be 
FDIST.RT(1.5,40,30) = used to show that F = 1.50 provides a p-value = .1256. Because the p-value > a = .05, 
1256. H, cannot be rejected. Hence, the sample results do not support the conclusion that women 


show greater variation in attitude on political issues than men. Table 11.4 provides a sum- 


mary of hypothesis tests about two population variances. 


Using Excel to Conduct a Hypothesis Test 


Excel’s F-Test Two-Sample for Variances tool can be used to conduct a hypothesis test 
comparing the variances of two populations. We illustrate by using Excel to conduct the 
two-tailed hypothesis test for the Dullus County School Bus study. Refer to the Excel 
worksheets in Figure 11.7 and in Figure 11.8 as we describe the tasks involved. 


os) . Enter/Access Data: Open the file SchoolBus. Column A contains the sample of 26 arrival 
= times for the Milbank Company and column B contains the sample of 16 arrival times for 
SchoolBus the Gulf Park Company. 


Apply Tools: The following steps describe how to use Excel’s F-Test Two-Sample for 


Variances tool. 


Step 1. Click the Data tab on the Ribbon 


Step 2. In the Analyze group, click Data Analysis 
Step 3. When the Data Analysis dialog box appears: 


Choose F-Test Two-Sample for Variances from the list of Analysis Tools 
Step 4. When the F-Test Two-Sample for Variances dialog box appears (Figure 11.7): 


Enter A/:A27 in the Variable 1 Range: box 
Enter B1:B17 in the Variable 2 Range: box 
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Select the check box for Labels 
Enter .05 in the Alpha: box 
(Note: This Excel procedure uses alpha as the area in the upper tail.) 
Select Output Range: and enter D/ in the box 
Click OK 


FIGURE 11.7 | Dialog Box for F-Test Two-Sample for Variances 


Variable 1 Range: 


Variable 2 Range: 


Output options 
© Output Range: 
Note: Rows 1 
12-15 and 19-25 © New Worksheet Ply: 
are hidden. : © New Workbook 


A | B | € 
Milbank Gulf Park 
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The value of .0405 in cell E9, labeled “P(F< =f) one-tail”, is the one-tail area associated 
with the test statistic F = 2.401. Thus, the two-tailed p-value is 2(.0405) = .081; we reject 
the null hypothesis at the .10 level of significance. If the hypothesis test had been a one- 
tail test (with œ = .05), the one-tail area in cell E9 would provide the p-value directly. We 
would not need to double it. 


NOTE + COMMENT 


Research confirms the fact that the F distribution is sensitive should not be used unless it is reasonable to assume that both 
to the assumption of normal populations. The F distribution populations are at least approximately normally distributed. 


EXERCISES 


Methods 

13. Find the following F distribution values from Table 4 of Appendix B (available online). 
a. Fs with degrees of freedom 5 and 10 
b. Fo; with degrees of freedom 20 and 15 
c. Fo; with degrees of freedom 8 and 12 
d. F o with degrees of freedom 10 and 20 

14. A sample of 16 items from population 1 has a sample variance s; = 5.8 and a sam- 
ple of 21 items from population 2 has a sample variance s; = 2.4. Test the following 
hypotheses at the .05 level of significance. 


Hy: 0} = o; 
H,: 0, > 05 
a. What is your conclusion using the p-value approach? 


b. Repeat the test using the critical value approach. 
15. Consider the following hypothesis test. 


Hy: 0} = o 
H; 0; 4 05 
a. What is your conclusion if n; = 21, sj = 8.2, n, = 26, and s} = 4.0? Use a = .05 


and the p-value approach. 
b. Repeat the test using the critical value approach. 


Applications 

16. Comparing Risk of Mutual Funds. Investors commonly use the standard deviation 
of the monthly percentage return for a mutual fund as a measure of the risk for the 
fund; in such cases, a fund that has a larger standard deviation is considered more risky 
than a fund with a lower standard deviation. The standard deviation for the American 
Century Equity Growth fund and the standard deviation for the Fidelity Growth 
Discovery fund were recently reported to be 15.0% and 18.9%, respectively. Assume 
that each of these standard deviations is based on a sample of 60 months of returns. Do 
the sample results support the conclusion that the Fidelity fund has a larger population 
variance than the American Century fund? Which fund is more risky? 

17. Repair Costs as Automobiles Age. In its 2016 Auto Reliability Survey, Consumer 
Reports asked subscribers to report their maintenance and repair costs. Most individu- 
als are aware of the fact that the average annual repair cost for an automobile depends 
on the age of the automobile. A researcher is interested in finding out whether the vari- 
ance of the annual repair costs also increases with the age of the automobile. A sample 
of 26 automobiles 4 years old showed a sample standard deviation for annual repair 
costs of $170 and a sample of 25 automobiles 2 years old showed a sample standard 
deviation for annual repair costs of $100. 
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18. 


19. 


20. 


21. 


a. State the null and alternative versions of the research hypothesis that the variance in 
annual repair costs is larger for the older automobiles. 
b. Ata .01 level of significance, what is your conclusion? What is the p-value? Discuss 
the reasonableness of your findings. 
Variance in Fund Amounts: Merrill Lynch Versus Morgan Stanley. Barron’s has 
collected data on the top 1000 financial advisors. Merrill Lynch and Morgan Stanley 
have many of their advisors on this list. A sample of 16 of the Merrill Lynch advisers 
and 10 of the Morgan Stanley advisers showed that the advisers managed many very 
large accounts with a large variance in the total amount of funds managed. The stan- 
dard deviation of the amount managed by the Merrill Lynch advisers was s, = $587 
million. The standard deviation of the amount managed by the Morgan Stanley advis- 
ers was s, = $489 million. Conduct a hypothesis test at œ = .10 to determine whether 
there is a significant difference in the population variances for the amounts managed 
by the two companies. What is your conclusion about the variability in the amount of 
funds managed by advisers from the two firms? 
Bag-Filling Machines at Jelly Belly. The variance in a production process is an im- 
portant measure of the quality of the process. A large variance often signals an oppor- 
tunity for improvement in the process by finding ways to reduce the process variance. 
Jelly Belly Candy Company is testing two machines that use different technologies 
to fill three pound bags of jelly beans. The file Bags contains a sample of data on the 
weights of bags (in pounds) filled by each machine. Conduct a statistical test to deter- 
mine whether there is a significant difference between the variances in the bag weights 
for two machines. Use a .05 level of significance. What is your conclusion? Which 
machine, if either, provides the greater opportunity for quality improvements? 
Salaries at Public Accounting Firms. On the basis of data provided by a Romac 
salary survey, the variance in annual salaries for senior partners in public accounting 
firms is approximately 2.1 and the variance in annual salaries for managers in public 
accounting firms is approximately 11.1. The salary data were provided in thousands of 
dollars. Assuming that the salary data were based on samples of 25 senior partners and 
26 managers, test the hypothesis that the population variances in the salaries are equal. 
At a .05 level of significance, what is your conclusion? 
Smartphone Battery Life. Battery life is an important issue for many smartphone 
owners. Public health studies have examined “low-battery anxiety” and acute anxiety 
called “nomophobia” that results when a smartphone user’s phone battery charge runs 
low and then dies (Wall Street Journal, May 5, 2018). Battery life between charges 
for the Samsung Galaxy S9 averages 31 hours when the primary use is talk time and 
10 hours when the primary use is Internet applications. Because the mean hours for 
talk time usage is greater than the mean hours for Internet usage, the question was 
raised as to whether the variance in hours of usage is also greater when the primary 
use is talk time. Sample data showing battery life between charges for the two applica- 
tions follows. 


Primary Use: Talking 


35.8 222 24.0 32.6 18.5 42.5 
28.0 23.8 30.0 22.8 20.3 255 
Primary Use: Internet 
14.0 125 16.4 WY Ss) Sal 
5.4 11.0 152 4.0 4.7 


a. Formulate hypotheses about the two population variances that can be used to 
determine whether the population variance in battery life is greater for the talk time 
application. 

b. What are the standard deviations of battery life for the two samples? 
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c. Conduct the hypothesis test and compute the p-value. Using a .05 level of signific- 
ance, what is your conclusion? 

22. Stopping Distances of Automobiles. A research hypothesis is that the variance of 
stopping distances of automobiles on wet pavement is substantially greater than the 
variance of stopping distances of automobiles on dry pavement. In the research study, 
16 automobiles traveling at the same speeds are tested for stopping distances on wet 
pavement and then tested for stopping distances on dry pavement. On wet pavement, 
the standard deviation of stopping distances is 32 feet. On dry pavement, the standard 
deviation is 16 feet. 

a. Ata .05 level of significance, do the sample data justify the conclusion that the 
variance in stopping distances on wet pavement is greater than the variance in stop- 
ping distances on dry pavement? What is the p-value? 

b. What are the implications of your statistical conclusions in terms of driving safety 
recommendations? 


SUMMARY 


In this chapter we presented statistical procedures that can be used to make inferences 
about population variances. In the process we introduced two new probability distribu- 
tions: the chi-square distribution and the F distribution. The chi-square distribution can 
be used as the basis for interval estimation and hypothesis tests about the variance of a 
normal population. 

We illustrated the use of the F distribution in hypothesis tests about the variances of 
two normal populations. In particular, we showed that with independent random samples 
of sizes n, and m, selected from two normal populations with equal variances of = 03, the 
sampling distribution of the ratio of the two sample variances s;/s3 has an F distribution 
with n; — 1 degrees of freedom for the numerator and n, — 1 degrees of freedom for the 


denominator. 
KEY FORMULAS 
Interval Estimate of a Population Variance 
=] 2 =j 2 
(n )s ETE (n - )s (11.7) 
Xan Xa -an) 
Test Statistic for Hypothesis Tests About a Population Variance 
(n — 1)s* 
oo 
Test Statistic for Hypothesis Tests About Population Variances with of = 0% 
si 
F=,> (11.10) 
S2 


SUPPLEMENTARY EXERCISES | 


23. Daily Hotel Room Occupancy. Because of staffing decisions, managers of the 
Gibson-Marimont Hotel are interested in the variability in the number of rooms 
occupied per day during a particular season of the year. A sample of 20 days of op- 
eration shows a sample mean of 290 rooms occupied per day and a sample standard 
deviation of 30 rooms. 

a. What is the point estimate of the population variance? 
b. Provide a 90% confidence interval estimate of the population variance. 
c. Provide a 90% confidence interval estimate of the population standard deviation. 
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24. Pricing of Initial Public Offerings. Initial public offerings (IPOs) of stocks are on 
average underpriced. The standard deviation measures the dispersion, or variation, 
in the underpricing-overpricing indicator. A sample of 13 Canadian IPOs that were 
subsequently traded on the Toronto Stock Exchange had a standard deviation of 14.95. 
Develop a 95% confidence interval estimate of the population standard deviation for 
the underpricing-overpricing indicator. 

25. Business Travel Costs. According to the 2017 Corporate Travel Index compiled by 
Business Travel News, the average daily cost for business travel in the United States 
rose to $321 per day (Executive Travel website). The file TravelCosts contains sample 
data for an analogous study on the estimated daily living costs for an executive trav- 
eling to various international cities. The estimates include a single room at a four-star 
hotel, beverages, breakfast, taxi fares, and incidental costs. 


City Daily Living Cost ($) City Daily Living Cost ($) 
Bangkok 242.87 Mexico City 212.00 
Bogota 260.93 Milan 284.08 
<a : Cairo 194.19 Mumbai 139.16 
= DATA file Dublin 260.76 Paris 436.72 
avenon Frankfurt 355.36 Rio de Janeiro 240.87 
Hong Kong 346.32 Seoul 310.41 
Johannesburg 165.37 Tel Aviv 223.73 
Lima 250.08 Toronto 18125 
London 326.76 Warsaw 238.20 
Madrid 283.56 Washington, D.C. 250.61 


a. Compute the sample mean. 
b. Compute the sample standard deviation. 
c. Compute a 95% confidence interval for the population standard deviation. 

26. Manufacturing of Ball Bearings. Ball bearing manufacturing is a very precise 
business and minimal part variability is critical. Large variances in the size of the ball 
bearings cause bearing failure and rapid wearout. Production standards call for a maxi- 
mum variance of .0001 inches”. Gerry Liddy has gathered a sample of 15 bearings that 
shows a sample standard deviation of .014 inches. 

a. Use a = .10 to determine whether the sample indicates that the maximum accept- 
able variance is being exceeded. 

b. Compute the 90% confidence interval estimate of the variance of the ball bearings 
in the population. 

27. Count Chocula Cereal. Filling boxes with consistent amounts of its cereals are crit- 
ical to General Mills’ success. The filling variance for boxes of Count Chocula cereal 
is designed to be .02 ounces’ or less. A sample of 41 boxes of Count Chocula shows a 
sample standard deviation of .16 ounces. Use œ = .05 to determine whether the vari- 
ance in the cereal box fillings is exceeding the design specification. 

28. Grubhub Food Delivery. Grubhub is a service that delivers food that its customers 
order online from participating restaurants. Grubhub claims consistent delivery times 
for its deliveries. A sample of 22 meal deliveries shows a sample variance of 1.5. Test 
to determine whether Hy: o° = 1 can be rejected. Use a = .10. 

29. Daily Patient Volume at Dental Clinic. A sample of 9 days over the past six months 
showed that Philip Sherman, DDS, treated the following numbers of patients at 
this dental clinic: 22, 25, 20, 18, 15, 22, 24, 19, and 26. If the number of patients 
seen per day is normally distributed, would an analysis of these sample data reject the 
hypothesis that the variance in the number of patients seen per day is equal to 10? Use 
a .10 level of significance. What is your conclusion? 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


Case Problem 1: Air Force Training Program 513 


30. Passenger Volume on Allegiant Airlines. A sample standard deviation for the number 
of passengers taking a particular Allegiant Airlines flight is 8. A 95% confidence inter- 
val estimate of the population standard deviation is 5.86 passengers to 12.62 passengers. 
a. Was a sample size of 10 or 15 used in the statistical analysis? 

b. Suppose the sample standard deviation of s = 8 was based on a sample of 25 flights. 
What change would you expect in the confidence interval for the population standard 
deviation? Compute a 95% confidence interval estimate of ø with a sample size 
of 25. 

31. Golf Scores. Is there any difference in the variability in golf scores for players on the 
LPGA Tour (the women’s professional golf tour) and players on the PGA Tour (the 
men’s professional golf tour)? A sample of 20 tournament scores from LPGA events 
showed a standard deviation of 2.4623 strokes, and a sample of 30 tournament scores 
from PGA events showed a standard deviation of 2.2118. Conduct a hypothesis test for 
equal population variances to determine whether there is any statistically significant 
difference in the variability of golf scores for male and female professional golfers. 
Use a = .10. What is your conclusion? 

32. Grade Point Average Comparison. The grade point averages of 352 students who 
completed a college course in financial accounting have a standard deviation of .940. 
The grade point averages of 73 students who dropped out of the same course have a 
standard deviation of .797. Do the data indicate a difference between the variances of 
grade point averages for students who completed a financial accounting course and 
students who dropped out? Use a .05 level of significance. (Note: F 9), with 351 and 
72 degrees of freedom is 1.466.) 

33. Weekly Cost Reporting. Stable cost reporting in a manufacturing setting is typically 
a sign that operations are running smoothly. The accounting department at Rockwell 
Collins, an avionics manufacturer, analyzes the variance of the weekly costs reported 
by two of its production departments. A sample of 16 cost reports for each of the two 
departments shows cost variances of 2.3 and 5.4, respectively. Is this sample sufficient 
to conclude that the two production departments differ in terms of weekly cost vari- 
ance? Use a = .10. 

34. Lean Process Improvement at the New York City Food Bank. In an effort to make 
better use of its resources, the New York City Food Bank engaged in lean process im- 
provement. This employee-driven kaizen effort resulted in a new method for packing 
meals for distribution to needy families. One goal of the process improvement effort 
was to reduce the variability in the meal-packing time. The following table summa- 
rizes information from a sample of data using the current method and the new method. 
Did the kaizen event successfully reduce the population variation? Use a = .10 and 
formulate the appropriate hypothesis test. 


Current Method New Method 
Sample Size n = 31 Ny = 25 
Sample Variance = D5 Ss = 12 


CASE PROBLEM 1: AIR FORCE TRAINING 
Bde a 


An Air Force introductory course in electronics uses a personalized system of instruction 
whereby each student views a videotaped lecture and then is given a programmed instruc- 
tion text. The students work independently with the text until they have completed the 
training and passed a test. Of concern is the varying pace at which the students complete 
this portion of their training program. Some students are able to cover the programmed 
instruction text relatively quickly, whereas other students work much longer with the text 
and require additional time to complete the course. The fast students wait until the slow 
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students complete the introductory course before the entire group proceeds together with 
other aspects of their training. 

A proposed alternative system involves use of computer-assisted instruction. In this 
method, all students view the same videotaped lecture and then each is assigned to a com- 
puter terminal for further instruction. The computer guides the student, working indepen- 
dently, through the self-training portion of the course. 

To compare the proposed and current methods of instruction, an entering class of 
122 students was assigned randomly to one of the two methods. One group of 61 students 
used the current programmed-text method and the other group of 61 students used the 
proposed computer-assisted method. The time in hours was recorded for each student in the 
study. The following data are provided in the file named Training. 


Course Completion Times (hours) for Current Training Method 


76 76 Yi 74 76 74 74 7 72 78 73 
78 iS 80 T V2 69 T U2 70 70 81 
76 78 72 82 72 73 71 70 77 78 73 
79 82 65 Te, 19 73 76 81 69 75 15 


æ ; 77 79 76 78 76 76 73 Vi 84 74 74 
= DATA file 69 79 66 70 74 72 
Training 
Course Completion Times (hours) for Proposed Computer-Assisted Method 
74 75 77 78 74 80 73 73 78 76 76 
74 77 69 76 75 72 75 WE 76 72 77 
73 77 69 77 75 76 74 77 75 78 72: 
Ui 78 78 76 75 76 76 WS) 76 80 WE 
76 75 73 1 TA 77 79 75 US 72 82 
76 76 74 We 78 Ta 
Managerial Report 

We discuss interval 1. Use appropriate descriptive statistics to summarize the training time data for each 

estimation and hypothesis method. What similarities or differences do you observe from the sample data? 

testing on the difference 2. Conduct a hypothesis test on the difference between the population means for the two 

between population means methods. Discuss your findings. 

in Chapter 10. 3. Compute the standard deviation and variance for each training method. Conduct a 


hypothesis test about the equality of population variances for the two training methods. 
Discuss your findings. 

4. What conclusion can you reach about any differences between the two methods? What 
is your recommendation? Explain. 

5. Can you suggest other data or testing that might be desirable before making a final 
decision on the training program to be used in the future? 


CASE PROBLEM 2: METICULOUS DRILL & 
REAMER 


Meticulous Drill & Reamer (MD&R) specializes in drilling and boring precise holes in hard 
metals (e.g., steel alloys, tungsten carbide, and titanium). The company recently contracted 
to drill holes with 3-centimeter diameters in large carbon-steel alloy disks, and it will have 

to purchase a special drill to complete this job. MD&R has eliminated all but two of the 
drills it has been considering: Davis Drills’ T2005 and Worth Industrial Tools’ AZ100. These 
producers have each agreed to allow MD&R to use a T2005 and an AZ100 for one week to 
determine which drill it will purchase. During the one-week trial, MD&R uses each of these 
drills to drill 31 holes with a target diameter of 3 centimeters in one large carbon-steel alloy 
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disk, then measures the diameter of each hole and records the results. MD&R’s results are 
provided in the table that follows and are available in the file MeticulousDrills. 


Hole Diameter 


T2005 AZ100 T2005 AZ100 T2005 AZ100 
3.06 2.91 3.05 2.97 3.04 3.06 
3.04 3.31 3.01 3.05 3.01 325 
æ , 3.13 2.82 28 2.95 2.95 2.82 
S= DATA file 3.01 3.01 3.12 2.92 3.14 322 
MeticulousDrills 2.95 2.94 3.04 2.71 ae) 2.93 
3.02 a7 3.10 2.77 3.01 3.24 
3.02 325 3.02 DIR 2.93 27y 
REP 3.39 2.92 3.18 3.00 2.94 
3.00 322 3.01 2.95 3.04 3.31 
3.04 2.97 315 2.86 
3.03 2.93 2.69 eG 


MD&R wants to consider both the accuracy (closeness of the diameter to 3 centimeters) 
and the precision (the variance of the diameter) of the holes drilled by the T2005 and the 
AZ100 when deciding which model to purchase. 


Managerial Report 
In making this assessment for MD&R, consider the following four questions: 


1. Are the holes drilled by the T2005 or the AZ100 more accurate? That is, which model 
of drill produces holes with a mean diameter closer to 3 centimeters? 

2. Are the holes drilled by the T2005 or the AZ100 more precise? That is, which model of 
drill produces holes with a smaller variance? 

3. Conduct a test of the hypothesis that the T2005 and the AZ100 are equally precise (that 
is, have equal variances) at a = .05. Discuss your findings. 

4. Which drill do you recommend to MD&R? Why? 
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STATISTICS IN PRACTICE 


United Way* 
ROCHESTER, NEW YORK 


United Way of Greater Rochester is a nonprofit organi- 
zation dedicated to improving the quality of life for all 

people in the seven counties it serves by meeting the 

community's most important human care needs. 

The annual United Way/Red Cross fund-raising 
campaign funds hundreds of programs offered by more 
than 200 service providers. These providers meet a 
wide variety of human needs—physical, mental, and 
social—and serve people of all ages, backgrounds, and 
economic means. 

The United Way of Greater Rochester decided to 
conduct a survey to learn more about community 
perceptions of charities. Focus-group interviews were 
held with professional, service, and general worker 
groups to obtain preliminary information on percep- 
tions. The information obtained was then used to help 
develop the questionnaire for the survey. The ques- United Way programs meet the needs of children as well 
tionnaire was pretested, modified, and distributed to z6 geldia, Jim Wesvage MGB (a VEE 


440 individuals. on perceptions of the percentage of funds going to 
A variety of descriptive statistics, including frequency administrative expenses (up to 10%, 11-20%, and 21% 
distributions and crosstabulations, were provided from or more). The other question asked for the occupation 
the data collected. An important part of the analysis of the respondent. 
involved the use of chi-square tests of independence. The test of independence led to rejection of the 
One use of such statistical tests was to determine null hypothesis and to the conclusion that perception 
whether perceptions of administrative expenses were of United Way administrative expenses is not inde- 
independent of the occupation of the respondent. pendent of the occupation of the respondent. Actual 
The hypotheses for the test of independence were administrative expenses were less than 9%, but 35% of 
as follows: the respondents perceived that administrative expenses 


were 21% or more. Hence, many respondents had 
inaccurate perceptions of administrative expenses. In 
this group, production-line, clerical, sales, and profes- 
sional-technical employees had the more inaccurate 
perceptions. 

The community perceptions study helped United 
Way of Rochester develop adjustments to its programs 
Two questions in the survey provided categorical data and fund-raising activities. In this chapter, you will learn 
for the statistical test. One question obtained data how tests, such as described here, are conducted. 


Ho: Perception of United Way administrative ex- 
penses is independent of the occupation of the 
respondent. 

Ha: Perception of United Way administrative ex- 
penses is not independent of the occupation of 
the respondent. 


*The authors are indebted to Dr. Philip R. Tyler, marketing consultant 
to the United Way, for providing this Statistics in Practice. 


In this chapter, we introduce three hypothesis testing procedures that extend our ability to 
make statistical inferences about populations. Specifically, we consider a test statistic based 
on the chi-square (”) distribution. These tests are all based on comparing observed sample 
results with those that are expected when the null hypothesis is true. The hypothesis testing 
conclusion is based upon using a chi-square test statistic to determine how “close” the 
sample results are to the expected results. 
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In Chapter 11 we showed 
how the chi-square 
distribution can be used 
to compute an interval 
estimate and conduct 
hypothesis tests about a 
population variance. 


The sum of the proba- 
bilities for a multinomial 
probability distribution 
equals 1. 
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All three hypothesis tests are designed for use with categorical data. In cases in which 
data are not naturally categorical, we define categories and consider the observation count 
in each category. The goodness of fit test in Section 12.1 is useful when we want to test 
whether a frequency distribution developed from categorical data is a good fit to a hypothe- 
sized probability distribution for the population. In Section 12.2 we show how the chi- 
square test of independence is used to determine whether two categorical variables sampled 
from one population are independent. Section 12.3 describes a chi-square test for multiple 
populations by showing how sample data from three or more populations can be used to 
determine whether the population proportions are equal. 


12.1 Goodness of Fit Test 


The chi-square goodness of fit test can be used to determine whether a random variable 
has a specific probability distribution. In this section we show how to conduct a goodness 
of fit test for a random variable with a multinomial probability distribution. 


Multinomial Probability Distribution 


A multinomial probability distribution is an extension of the binomial probability distri- 
bution to the case where there are three or more categories of outcomes per trial. The cate- 
gory probabilities are the key parameters of the multinomial distribution. For an application 
of a goodness of fit test to a multinomial probability distribution, consider the market share 
study being conducted by Scott Marketing Research. Over the past year, market shares for 
a certain product have stabilized at 30% for company A, 50% for company B, and 20% for 
company C. Since each customer is classified as buying from one of these companies, we 
have a multinomial probability distribution with three possible categories of outcomes. The 
probability for each of the three categories is as follows. 


Pa = probability a customer purchases the company A product 
Pg = probability a customer purchases the company B product 


Pc = probability a customer purchases the company C product 


Using the historical market shares, we have a multinomial probability distribution with 
Pa = .30, pg = -50, and pc = .20. 

Company C plans to introduce a “new and improved” product to replace its current 
entry in the market. Company C has retained Scott Marketing Research to determine 
whether the new product will alter or change the market shares for the three companies. 
Specifically, Scott Marketing Research will introduce a sample of customers to the new 
company C product and then ask the customers to indicate a preference for the company A 
product, the company B product, or the new company C product. Based on the sample 
data, the following hypothesis test can be used to determine whether the new company C 
product is likely to change the historical market shares for the three companies. 


A: pa = -30, pg = 50, and pe = .20 
H: The probabilities are not p, = .30, pg = .50, and pc = .20 


The null hypothesis is based on the historical multinomial probability distribution for the 
market shares. If sample results lead to the rejection of Hp, Scott Marketing Research will 
have evidence to conclude that the introduction of the new company C product will change 
the market shares (the category probabilities for the multinomial distribution). 

Let us assume that the market research firm has used a consumer panel of 200 customers. 
Each customer was asked to specify a purchase preference among the three alternatives: com- 
pany A’s product, company B’s product, and company C’s new product. The 200 responses 
are summarized in the following frequency distribution. 
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Observed 

Category Frequency 
Company A 48 
Company B 98 
Company C 54 
Total 200 


We now can perform a goodness of fit test to determine whether the sample of 
200 customer purchase preferences is consistent with the null hypothesis. The goodness 
of fit test is based on a comparison of observed frequencies from the sample with the 
expected frequencies under the assumption that the null hypothesis is true. Hence, 
the next step is to compute expected purchase preferences for the 200 customers under 
the assumption that Hy: pa = .30, pg = .50, and pc = .20 is true. Doing so provides the 
following expected frequency distribution. 


Expected 

Category Frequency 
Company A 200(.30) = 60 
Company B 200(.50) = 100 
Company C 200(.20) = 40 
Total 200 


Note that the expected frequency for each category is found by multiplying the sample size 
of 200 by the hypothesized probability for the category. 

The goodness of fit test now focuses on the differences between the observed frequen- 
cies and the expected frequencies. Whether the differences between the observed and 
expected frequencies are “large” or “small” is a question answered with the aid of the 
following chi-square test statistic. 


In equation (12.1) the 

difference between thé TEST STATISTIC FOR GOODNESS OF FIT 
observed (f) and expected k 
(e) frequencies is squared; X = D (12.1) 
thus, x? will always be 
positive. where 


f; = observed frequency for category i 


e. 


L 


k = the number of categories 


expected frequency for category i 


Note: The test statistic has a chi-square distribution with k — 1 degrees of freedom 
provided that the expected frequencies are 5 or more for all categories. 


Let us continue with the Scott Marketing Research example and use the sample data 
to test the hypothesis that the multinomial distribution has the market share probabilities 
Pa = -30, pg = .50, and pc = .20. We will use an a = .05 level of significance. We proceed 
by using the observed and expected frequencies to compute the value of the test statistic. 
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TABLE 12.1 Computation of the Chi-Square Test Statistic for the Scott Marketing Research 


Market Share Study 


Squared Difference 


Observed Expected Squared Divided by 
Hypothesized Frequency Frequency’ Difference Difference Expected Frequency 
Category Probability (f) (e;) (f, — e) (f, - e)? (f, — e;}?/e; 
Company A 30 48 60 =2 144 2.40 
Company B 5O 98 100 =A 4 0.04 
Company C 20 54 40 14 196 4.90 
Total 200 Re S188 


With the expected frequencies all 5 or more, the computation of the chi-square test statistic 
is shown in Table 12.1. Thus, we have y* = 7.34. 
The test for goodness of fit We will reject the null hypothesis if the differences between the observed and expected 
is always a one-tailed test frequencies are large. Thus the test of goodness of fit will always be an upper tail test. 
with the rejection occurring We can use the upper tail area for the test statistic and the p-value approach to determine 
in the upper tail of the hewier th lh hesi berei d. Withk-1=3-1=2d f freed 
chi-square distribution. whether the nu per Este cal e Tejecte . Wit 7 = egrees o Iree om, 
row two of the chi-square distribution Table 3 of Appendix B provides the following: 


Area in Upper Tail | .10 .05 .025 01 005 
X Value (2 df) | 4.605 5.991 vn 9.210 10.597 


X = 7.34 


The test statistic x? = 7.34 is between 5.991 and 7.378. Thus, the corresponding upper tail 
area or p-value must be between .05 and .025. With p-value = .05, we reject H) and con- 
clude that the introduction of the new product by company C will alter the historical market 
shares. In the Using Excel subsection which follows we will see that the p-value = .0255. 

Instead of using the p-value, we could use the critical value approach to draw the same 
conclusion. With a = .05 and 2 degrees of freedom, the critical value for the test statistic is 
Xs = 5.991. The upper tail rejection rule becomes 


Reject H, if X = 5.991 


With 7.34 > 5.991, we reject H}. The p-value approach and critical value approach provide 
the same hypothesis testing conclusion. 

Now that we have concluded that the introduction of a new company C product will 
alter the market shares for the three companies, we are interested in knowing more about 
how the market shares are likely to change. Using the historical market shares and the 
sample data, we summarize the data as follows: 


Company Historical Market Share (%) Sample Data Market Share (%) 


A 30 48/200 = .24, or 24 
B 50 98/200 = .49, or 49 
G 20 54/200 = .27, or 27 
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FIGURE 12.14 Bar Chart of Market Shares by Company Before and After the 


New Product for Company C 


m Historical Market Share 
m After New Product C 
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The historical market shares and the sample market shares are compared in the bar 
chart shown in Figure 12.1. The bar chart shows that the new product will likely 
increase the market share for company C. Comparisons for the other two compa- 
nies indicate that company C’s gain in market share will hurt company A more 
than company B. 

Let us summarize the steps that can be used to conduct a goodness of fit test for a 
hypothesized multinomial probability distribution. 


MULTINOMIAL PROBABILITY DISTRIBUTION GOODNESS OF FIT TEST 
1. State the null and alternative hypotheses. 


Ho: The population follows a multinomial probability distribution with speci- 
fied probabilities for each of the k categories 

H,: The population does not follow a multinomial probability distribution with 
the specified probabilities for each of the k categories 


2. Select a random sample and record the observed frequencies f; for each 
category. 

3. Assume the null hypothesis is true and determine the expected frequency e; 

in each category by multiplying the hypothesized category probability by the 

sample size. 

If the expected frequency e; is at least 5 for each category, compute the value of 

the test statistic. 


e 


5. Rejection rule: 
p-value approach: Reject Hg if p-value = a 
Critical value approach: Reject H, if x* = x2, 


where a is the level of significance for the test and there are k — 1 degrees of 
freedom. 
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FIGURE 12.2 | Excel Worksheet for the Goodness of Fit Test for the Scott Marketing 
Research Market Share Study 


d B e D E F 
“1 (Customer | Prefe 
21 rere 
32 Categories Obs. Frequency Al c D E F 
43 A 48 a 
$4 B 98 2 Vin 
65 c 54 2 < 
76 Total 200 3 3 a a nue 
87 5 4 B 98 
98 Hyp. Probability Exp. Frequency 6 5. c 54 
10 9 0.3 =D10*SE$7 7 6 A Total 200 
11 10 0.5 =D11*SES7 8 T A 
12111 0.2 =D12*$E$7 9 si PA Hyp. Probability Exp. Frequency 
13 12 10 9) ex 03 60 
14 13 p-value =CHISQ.TEST(E4:E6,E10:E12) 4) 10 A 0.5 100 
15 14 12 nies 02 40 
199 13 2 B 
201 200 14 13; © p-value 0.0255 
202 15 u A 

Note: Rows 16-199 a a è 


are hidden. 


Using Excel to Conduct a Goodness of Fit Test 


Excel can be used to conduct a goodness of fit test for the market share study conducted 
by Scott Marketing Research. Refer to Figure 12.2 as we describe the tasks involved. The 
formula worksheet is in the background; the value worksheet is in the foreground. 


$ DATA fi le Enter/Access Data: Open the file Research. The data are in cells B2:B201 and labels are 
= in column A and cell B1. Values for the hypothesized category probabilities were entered 
Research into cells D10:D12. 


Apply Tools: Cells D3:E7 show the results of using Excel’s Recommended PivotTable tool 
(see Section 2.1 for details regarding how to use this tool) to construct a frequency distribu- 
tion for the purchase preference data. 


Enter Functions and Formulas: The Excel formulas in cells E10:E12 were used to com- 
pute the expected frequencies for each category by multiplying the hypothesized propor- 
tions by the sample size. Once the observed and expected frequencies have been computed, 
Excel’s CHISQ.TEST function can be used to compute the p-value for a test of goodness 
of fit. The inputs to the CHISQ.TEST function are the range of values for the observed 

and expected frequencies. To compute the p-value for this test, we entered the following 
function into cell E14: 


=CHISOQ.TEST(E4:E6,E10:E12) 


The value worksheet shows that the resulting p-value is .0255. Thus, with a = .05, we 
reject H) and conclude that the introduction of the new product by company C will change 
the current market share structure. 


Methods 
1. Test the following hypotheses for a multinomial probability distribution by using the 
X goodness of fit test. 


Ho: pa = 40, pg = .40, and pe = .20 
H): The probabilities are not 
Pa = 40, pg = .40, and po = .20 
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A sample of size 200 yielded 60 in category A, 120 in category B, and 20 in category C. 
Use a = .01 and test to see whether the probabilities are as stated in Hp. 
a. Use the p-value approach. 
b. Repeat the test using the critical value approach. 

2. Suppose we have a multinomial population with four categories: A, B, C, and D. The 
null hypothesis is that the proportion of items is the same in every category. The null 
hypothesis is 


Ay: Pa = Psg = Po = Pp = -25 
A sample of size 300 yielded the following results. 
A: 85 B:95 C:50 D:70 


Use a = .05 to determine whether H, should be rejected. What is the p-value? 


Applications 
3. Television Audiences Across Networks. During the first 13 weeks of the television sea- 
son, the Saturday evening 8:00 P.M. to 9:00 P.M. audience proportions were recorded as 
ABC 29%, CBS 28%, NBC 25%, and Independents 18%. A sample of 300 homes two 
weeks after a Saturday night schedule revision yielded the following viewing audience 
data: ABC 95 homes, CBS 70 homes, NBC 89 homes, and Independents 46 homes. Test 
with a = .05 to determine whether the viewing audience proportions changed. 
A DATA file 4. M&M Candy Colors. Mars, Inc. manufactures M&M’s, one of the most popular 
= candy treats in the world. The milk chocolate candies come in a variety of colors 
including blue, brown, green, orange, red, and yellow (M&M website). The overall 
proportions for the colors are .24 blue, .13 brown, .20 green, .16 orange, .13 red, and 
.14 yellow. In a sampling study, several bags of M&M milk chocolates were opened 
and the following color counts were obtained. 


M&M 


Blue Brown Green Orange Red Yellow 
105 72 89 84 70 80 


Use a .05 level of significance and the sample data to test the hypothesis that the over- 
all proportions for the colors are as stated above. What is your conclusion? 

5. America’s Favorite Sports. The Harris Poll tracks the favorite sport of Americans 
who follow at least one sport. Results of the poll show that professional football is the 
favorite sport of 33% of Americans who follow at least one sport, followed by baseball 
at 15%, men’s college football at 10%, auto racing at 6%, men’s professional basket- 
ball at 5%, and ice hockey at 5%, with other sports at 26%. Consider a survey in which 
344 college undergraduates who follow at least one sport were asked to identify their 
favorite sport produced the following results: 


Men's Men's 
Professional College Auto Professional Ice Other 
Football Baseball Football Racing Basketball Hockey Sports 
111 39 46 14 6 20 108 


Do college undergraduate students differ from the general public with regard to their 
favorite sports? Use a = .05. 


6. Traffic Accidents by Day of Week. The National Highway Traffic Safety Admin- 
istration reports the percentage of traffic accidents occurring each day of the week. 
Assume that a sample of 420 accidents provided the following data. 
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Sunday Monday Tuesday Wednesday Thursday’ Friday Saturday 
66 50 53 47 55 69 80 


a. Conduct a hypothesis test to determine whether the proportion of traffic accidents is 
the same for each day of the week. What is the p-value? Using a .05 level of signifi- 
cance, what is your conclusion? 

b. Compute the percentage of traffic accidents occurring on each day of the week. 
What day has the highest percentage of traffic accidents? Does this seem reason- 
able? Discuss. 


12.2 Test of Independence 


In this section we show how the chi-square test of independence can be used to determine 
whether two categorical variables sampled from one population are independent. Like the 
goodness of fit test, the test of independence is also based on comparing observed and 
expected frequencies. For this test we take one sample from a single population and record 
the values for two categorical variables. We then summarize the data by counting the 
number of responses for each combination of a category for variable 1 and a category for 
variable 2. The null hypothesis for this test is that the two categorical variables are inde- 
pendent. Thus, the test is referred to as a test of independence. We will illustrate this test 
with the following example. 

A beer industry association conducts a survey to determine the preferences of beer 
drinkers for light, regular, and dark beers. A sample of 200 beer drinkers is taken with each 
person in the sample asked to indicate a preference for one of the three types of beers: 
light, regular, or dark. At the end of the survey questionnaire, the respondent is asked to 
provide information on a variety of demographics including gender: male or female. A 
research question of interest to the association is whether preference for the three types of 
beer is independent of the gender of the beer drinker. If the two categorical variables, beer 
preference and gender, are independent, beer preference does not depend on gender and 
the preference for light, regular, and dark beer can be expected to be the same for male and 
female beer drinkers. However, if the test conclusion is that the two categorical variables 
are not independent, we have evidence that beer preference is associated with or dependent 
upon the gender of the beer drinker. As a result, we can expect beer preferences to differ 
for male and female beer drinkers. In this case, a beer manufacturer could use this infor- 
mation to customize its promotions and advertising for the different target markets of male 
and female beer drinkers. 

The hypotheses for this test of independence are as follows: 


Hy: Beer preference is independent of gender 
H,: Beer preference is not independent of gender 


The two-way table used The sample data will be summarized in a two-way table with beer preferences of light, regular, 

to summarize the datais and dark as one of the variables and gender of male and female as the other variable. Since an 

also referred to as a objective of the study is to determine whether there is a difference between the beer preferences 

contingency table. for male and female beer drinkers, we consider gender an explanatory variable and follow the 
usual practice of making the explanatory variable the column variable in the crosstabulation. 
The beer preference is the categorical response variable and is shown as the row variable. The 
sample results for the 200 beer drinkers in the study are summarized in Table 12.2. 

The sample data are summarized based on the combination of beer preference and gender for 
the individual respondents. For example, 51 individuals in the study were males who preferred 
light beer, 56 individuals in the study were males who preferred regular beer, and so on. Let us 
now analyze the data in the table and test for independence of beer preference and gender. 

Because we selected a sample of beer drinkers, summarizing the data for each variable 
separately will provide some insights into the characteristics of the beer drinker population. 
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TABLE 12.2 Sample Results for Beer Preferences of Male and Female Beer 
Drinkers (Observed Frequencies) 


SS . Gender 
= DATA file Male Female Total 
BeerPreference 
Light 51 39 90 
Beer Preference Regular 56 21 77 
Dark p25) 8 23 
Total 132 68 200 


For the categorical variable gender, 132 of the 200 beer drinkers in the sample are male. 
Thus, we estimate that 132/200 = .66, or 66%, of the beer drinker population is male. Sim- 
ilarly we estimate that 68/200 = .34, or 34%, of the beer drinker population is female. The 
sample data suggest that male beer drinkers outnumber female beer drinkers approximately 
2 to 1. Sample proportions or percentages for the three types of beer are 


Prefer Light Beer 90/200 = .450, or 45.0% 
Prefer Regular Beer 77/200 = .385, or 38.5% 
Prefer Dark Beer 33/200 = .165, or 16.5% 


Across all beer drinkers in the sample, light beer is preferred most often and dark beer is 
preferred least often. 

Let us now conduct the chi-square test to determine whether beer preference and gender 
are independent. The data in Table 12.2 are the observed frequencies for the two categories 
of gender and the three categories of beer preference. 

A table of expected frequencies is developed based on the assumption that gender and 
beer preference are independent as stated in the null hypothesis. We showed above that for 
the sample of 200 beer drinkers, the proportions preferring light, regular, and dark beer are 
.450, .385, and .165, respectively. If the independence assumption is valid, we can conclude 
that these proportions must be applicable to both male and female beer drinkers. Thus, 
under the assumption of independence, we would expect that for the 132 male beer drinkers 
sampled that .450(132) = 59.40 would prefer light beer, .385(132) = 50.82 would prefer 
regular beer, and .165(132) = 21.78 would prefer dark beer. Application of the same pro- 
portions to the female beer drinkers leads to the expected frequencies shown in Table 12.3. 

Let us generalize the approach to computing expected frequencies by letting e,; denote 
the expected frequency in row i and column j of the table of expected frequencies. With 
this notation, let us reconsider the expected frequency calculation for light beer (row 1) and 
male beer drinkers (column 1), that is, the expected frequency e}. 


TABLE 12.3 Expected Frequencies If Beer Preference Is Independent 
of the Gender of the Beer Drinker 


Gender 
Male Female Total 
Light 59.40 30.60 90 
Beer Preference Regular 50.82 26.18 77 
Dark 21.78 122 33 
Total 132.00 68.00 200 
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Note that 90 is the total number of light beer responses (row 1 total), 132 is the total 
number of male respondents in the sample (column 1 total), and 200 is the total sample 
size. Following the logic of the preceding paragraphs, we can compute the expected fre- 
quency in row | and column | as follows: 


Row 1 Total 
ey 


90 
————— ] (Col 1 Total) = —~(132) = 59.40 
Sample a) eran ey 100° ) 
Rewriting this expression slightly we obtain 


_ (Row 1 Total) (Column 1 Total) _ (90)(132) 
Sample Size 200 


= 59.40 


en 


Generalizing this expression shows that the following formula can be used to compute 
the expected frequencies under the assumption that H, is true. 


(Row i Total)(Column j Total) 
7 Sample Size 


e (12.2) 


ij 
For example, e,, = (90)(132)/200 = 59.40 is the expected frequency for male beer drinkers 
who would prefer light beer if beer preference is independent of gender, e,, = (90)(68)/200 = 
30.60, and so on. Show that equation (12.2) can be used to find the other expected frequencies 
shown in Table 12.3. 

Using the table of observed frequencies (Table 12.2) and the table of expected frequen- 
cies (Table 12.3), we now wish to compute the chi-square statistic for our test of inde- 
pendence. Since the test of independence involves r rows and c columns, the formula for 
computing x° involves a double summation. 


(fi e) 
e= py (12.3) 
ij i 


J 


With r rows and c columns in the table, the chi-square distribution will have (r — 1)(c — 1) 
degrees of freedom provided the expected frequency is at least 5 for each cell. Thus, in this 
application we will use a chi-square distribution with (3 — 1)(2 — 1) = 2 degrees of freedom. 
The complete steps to compute the chi-square test statistic are summarized in Table 12.4. 

We can use the upper tail area of the chi-square distribution with 2 degrees of freedom 
and the p-value approach to determine whether the null hypothesis that beer preference is 


TABLE 12.4 Computation of the Chi-Square Test Statistic for the Test of Independence 


Between Beer Preference and Gender 


Squared Difference 


Observed Expected Squared Divided by 
Beer Frequency Frequency Difference Difference Expected Frequency 

Preference Gender fi ej (fj — ej) fj- e)? (fj — e,)?/e; 
Light Male 51 59.40 —8.40 70.56 edd 
Light Female 39 30.60 8.40 70.56 231 
Regular Male 56 50.82 518 26.83 53 
Regular Female 21 26.18 =5,118 26.83 1.02 
Dark Male 25 278 322 10 .48 
Dark Female 8 1122 7322 10.37 92 
Total 200 200.00 X = 6.45 
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independent of gender can be rejected. Using row two of the chi-square distribution table 
shown in Table 3 of Appendix B, we have the following: 


Area in Upper Tail | 10 05 025 01 005 
X Value (2 df) 4.605 5.991 7.378 9.210 10.597 


x = 6.45 


Thus, we see the upper tail area at x” = 6.45 is between .05 and .025, and so the correspond- 
ing upper tail area or p-value must be between .05 and .025. With p-value = .05, we reject 
H, and conclude that beer preference is not independent of the gender of the beer drinker. 
Stated another way, the study shows that beer preference can be expected to differ for male 
and female beer drinkers. In the Using Excel subsection that follows, we will see that the 
p-value = .0398. 

Instead of using the p-value, we could use the critical value approach to draw the same 
conclusion. With a = .05 and 2 degrees of freedom, the critical value for the chi-square 
test statistic is x7); = 5.991. The upper tail rejection region becomes 


Reject H, if X = 5.991 


With 6.45 = 5.991, we reject H}. Again we see that the p-value approach and the critical 
value approach provide the same conclusion. 

While we now have evidence that beer preference and gender are not independent, we 
will need to gain additional insight from the data to assess the nature of the association 
between these two variables. One way to do this is to compute the probability of the beer 
preference responses for males and females separately. These calculations are as follows: 


Beer Preference Male Female 
Light 5/12 =" 3864 on 3864% 39/69: =r 5735 On 57 35% 
Regular 56/132 = .4242, or 42.42% 21/68 = .3088, or 30.88% 
Dark 25/132 = .1894, or 18.94% 8/68 = .1176, or 11.76% 


A bar chart showing the beer preference for male and female beer drinkers is shown in 


Figure 12.3. 
FIGURE 12.3 Bar Chart Comparison of Beer Preference by Gender 
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The expected frequencies 
must all be 5 or more 

for the chi-square test to 
be valid. 


This chi-square test is 

a one-tailed test with 
rejection of Hy occurring 
in the upper tail of a 
chi-square distribution 
with (r — 1)(c — 1) degrees 
of freedom. 


(@ 


DATA file 


BeerPreference 
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What observations can you make about the association between beer preference and 
gender? For female beer drinkers in the sample, the highest preference is for light beer at 
57.35%. For male beer drinkers in the sample, regular beer is most frequently preferred 
at 42.42%. While female beer drinkers have a higher preference for light beer than males, 
male beer drinkers have a higher preference for both regular beer and dark beer. Data visu- 
alization through bar charts such as shown in Figure 12.3 is helpful in gaining insight as to 
how two categorical variables are associated. 

Before we leave this discussion, we summarize the steps for a test of independence. 


CHI-SQUARE TEST FOR INDEPENDENCE OF TWO CATEGORICAL VARIABLES 
1. State the null and alternative hypotheses. 


Ho: The two categorical variables are independent 
H: The two categorical variables are not independent 


Select a random sample from the population and collect data for both variables 
for every element in the sample. Record the observed frequencies, f., in a table 
with r rows and c columns. 

Assume the null hypothesis is true and compute the expected frequencies, e; 


If the expected frequency, e;, is 5 or more for each cell, compute the test statistic: 


(fi 
>>. 


2) 
= e) 
Si 


Rejection rule: 


p-value approach: Reject H, if p-value = a 


Critical value approach: Reject Hy if x’ = x2 


where the chi-square distribution has (r — 1)(c — 1) degrees of freedom and a is 
the level of significance for the test. 


Finally, if the null hypothesis of independence is rejected, summarizing the probabilities 
as shown in the above example will help the analyst determine where the association or 
dependence exists for the two categorical variables. 


Using Excel to Conduct a Test of Independence 


Excel can be used to conduct a test of independence for the beer preference example. Refer 
to Figure 12.4 as we describe the tasks involved. The formula worksheet is in the back- 
ground; the value worksheet is in the foreground. 


Enter/Access Data: Open the file BeerPreference. The data are in cells B2:C201 and 
labels are in column A and cells B1:C1. 


Apply Tools: Cells E3:H8 show the contingency table resulting from using Excel’s 
PivotTable tool (see Section 2.4 for details regarding how to use this tool) to construct a 
two-way table with beer preferences of light, regular, and dark as one of the variables and 
gender of male and female as the other variable. 


Enter Functions and Formulas: The Excel formulas in cells F12:G14 were used to com- 
pute the expected frequencies for each row and column. Once the observed and expected 
frequencies have been computed, Excel’s CHISQ.TEST function can be used to compute 
the p-value for a test of independence. The inputs to the CHISQ.TEST function are the 
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FIGURE 12.4 | Excel Worksheet for the Beer Preference Test of Independence 
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range of values for the observed and expected frequencies. To compute the p-value for this 
test of independence, we entered the following function into cell G16: 


=CHISQ.TEST(F5:G7,F12:G14) 


The value worksheet shows that the resulting p-value is .0398. Thus, with a = .05, we reject 
H, and conclude that beer preference is not independent of the gender of the beer drinker. 


NOTES + COMMENTS 


The test statistic for the chi-square tests in this chapter re- two adjacent categories to obtain an expected frequency of 
quires an expected frequency of 5 for each category. When a 5 or more in each category. 
category has fewer than five, it is often appropriate to combine 


SES 


Methods 
7. The following table contains observed frequencies for a sample of 200. Test for inde- 
pendence of the row and column variables using œ = .05. 


Column Variable 


Row Variable A B C 
P 20 44 50 
Q 30 26 30 


8. The following table contains observed frequencies for a sample of 240. Test for inde- 
pendence of the row and column variables using a = .05. 


Column Variable 


Row Variable A B (E 
P 20 30 20 
Q 30 60 25 
R 10 15 30 
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Applications 
9. Airline Ticket Purchases for Domestic and International Flights. A Bloomberg 
Businessweek subscriber study asked, “In the past 12 months, when traveling for 
business, what type of airline ticket did you purchase most often?” A second question 
asked if the type of flight was domestic or international travel. Sample data obtained 
are shown in the following table. 


Type of Flight 


Type of Ticket Domestic International 
First Class 29 22 
Business Class 05 121 
Economy Class 518 135 


a. Using a .05 level of significance, is the type of ticket purchased independent of the 
type of flight? What is your conclusion? 
b. Discuss any dependence that exists between the type of ticket and type of flight. 
: 10. Hiring and Firing Plans at Private and Public Companies. A Deloitte employment 

DATA file i i 

survey asked a sample of human resource executives how their company planned 
to change its workforce over the next 12 months. A categorical response variable 
showed three options: The company plans to hire and add to the number of employ- 
ees, the company plans no change in the number of employees, or the company plans 
to lay off and reduce the number of employees. Another categorical variable indicated 
if the company was private or public. Sample data for 180 companies are summarized 


(@ 


WorkforcePlan 


as follows. 
Company 
Employment Plan Private Public 
Add Employees 37 32 
No Change 19 34 
Lay Off Employees 16 42 


a. Conduct a test of independence to determine whether the employment plan for the 
next 12 months is independent of the type of company. At a .05 level of signifi- 
cance, what is your conclusion? 

b. Discuss any differences in the employment plans for private and public companies 
over the next 12 months. 

: 11. Generational Differences in Work Place Attitudes. In 2015, Addison Group and 
DATA f ile Kelton surveyed the work preferences and attitudes of 1006 working adults spread 
over three generations—Baby Boomers, Generation X, and Millennials (Society for 
Human Resource Management website). One question asked individuals if they would 
leave their current job to make more money at another job. The file Millennials contains 
the sample data, which is also summarized in the following table. 


(@ 


Millennials 


Generation 
Leave Job for 
More Money? Baby Boomer Generation X Millennial 
Yes 129 152 164 
No 207 183 Al 
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Conduct a test of independence to determine whether interest in leaving a current job 
for more money is independent of employee generation. What is the p-value? Using a 
.05 level of significance, what is your conclusion? 
; 12. Vehicle Quality Ratings. A J.D. Power and Associates vehicle quality survey asked 
DATA f ile new owners a variety of questions about their recently purchased automobile. One 
AutoQuality question asked for the owner’s rating of the vehicle using categorical responses of 
average, outstanding, and exceptional. Another question asked for the owner’s educa- 
tion level with the categorical responses some high school, high school graduate, some 
college, and college graduate. Assume the sample data below are for 500 owners who 
had recently purchased an automobile. 


(@ 


Education 
Quality Rating Some HS HS Grad Some College College Grad 
Average 35 30 20 60 
Outstanding 45 45 50 90 
Exceptional 20 25 30 50 


a. Use a .05 level of significance and a test of independence to determine whether a 
new owner’s vehicle quality rating is independent of the owner’s education. What is 
the p-value and what is your conclusion? 

b. Use the overall percentage of average, outstanding, and exceptional ratings 
to comment on how new owners rate the quality of their recently purchased 
automobiles. 

13. Company Reputation and Management Quality Survey. The Wall Street Journal 
Annual Corporate Perceptions Study surveys readers and asks how each rates the 
quality of management and the reputation of the company for over 250 worldwide 
corporations. Both the quality of management and the reputation of the company were 
rated on an excellent, good, and fair categorical scale. Assume the sample data for 
200 respondents below applies to this study. 


Reputation of Company 


Quality of Management Excellent Good Fair 
Excellent 40 25 5 
Good 35 35 10 
Fair 25 10 15 


a. Use a .05 level of significance and test for independence of the quality of man- 
agement and the reputation of the company. What is the p-value and what is your 
conclusion? 

b. If there is a dependence or association between the two ratings, discuss and use 
probabilities to justify your answer. 

14. Attitudes Toward New Nuclear Power Plants. As the price of oil rises, there is 
increased worldwide interest in alternate sources of energy. The Financial Times/ 
Harris Poll surveyed people in six countries to assess attitudes toward a variety of 
alternate forms of energy. The data in the following table are a portion of the poll’s 
findings concerning whether people favor or oppose the building of new nuclear 
power plants. 
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Country 

Great United 
Response Britain France Italy Spain Germany States 
Strongly favor 141 161 298 133 128 204 
Favor more than oppose 348 366 309 222 PU 326 
Oppose more than favor 381 334 219. Sit 622) 316 
Strongly oppose 217 215 AS) 443 389 174 


15. 


16. 


a. How large was the sample in this poll? 

b. Conduct a hypothesis test to determine whether people’s attitude toward building 
new nuclear power plants is independent of country. What is your conclusion? 

c. Using the percentage of respondents who “strongly favor” and “favor more than 
oppose,” which country has the most favorable attitude toward building new nuclear 
power plants? Which country has the least favorable attitude? 

Research Classification of Higher Education. The Carnegie Classification of Institutes 

of Higher Education categorizes colleges and universities on the basis of their research 

and degree-granting activities. Universities that grant doctoral degrees are placed into 
one of three classifications: highest research activity, higher research activity, or mod- 
erate research activity. The Carnegie classifications for public and not-for-profit private 
doctoral degree-granting universities follow. 


Carnegie Classification 


Type of Highest Research Higher Research Moderate Research 
University Activity Activity Activity 
Public 81 76 38 
Not-For-Profit 34 31 58 

Private 


Using a .05 level of significance, conduct a test of independence to determine whether 
Carnegie classification is independent of type of university for universities that grant 
doctoral degrees. What is the p-value and what is your conclusion? 

Movie Critic Opinions. On a television program, two movie critics provide their 
reviews of recent movies and discuss. It is suspected that these hosts deliberately dis- 
agree in order to make the program more interesting for viewers. Each movie review 
is categorized as Pro (“thumbs up”), Con (“thumbs down’), or Mixed. The results of 
160 movie ratings by the two hosts are shown here. 


Host B 
Host A Con Mixed Pro 
Con 24 8 18 
Mixed 8 183 11 
Pro 10 9 64 


Use a test of independence with a .01 level of significance to analyze the data. What is 
your conclusion? 
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12.3 Testing for Equality of Three or More 
Population Proportions 


We use the chi-square test In this section, we show how the chi-square (x^) test statistic can be used to make statis- 
statistic in a manner similar tical inferences about the equality of population proportions for three or more popula- 
to the way we have used tions. The methodology is similar to that for the test of independence in Section 12.2 
the normal (z) test statistic, in the sense that we will compare a table of observed frequencies to a table of expected 
t test statistic, and the F frequencies. 


test statistic for hypothesis Using the notation 
vari naptar g 10; pı = population proportion for population 1 
P = population proportion for population 2 
and 


P, = population proportion for population k 


the hypotheses for the equality of population proportions for k = 3 populations are as 
follows: 


Aly: Py = Py = °°" = Dx 
H: Not all population proportions are equal 


If the sample data and the chi-square test computations indicate H, cannot be rejected, 
we cannot detect a difference among the k population proportions. However, if the 
sample data and the chi-square test computations indicate Hj) can be rejected, we have 
statistical evidence to conclude that not all k population proportions are equal; that 

is, one or more population proportions differ from the other population proportions. 
Further analyses can be conducted to determine which population proportion or propor- 
tions are significantly different from others. Let us demonstrate this chi-square test by 
considering an application. 

Organizations such as J.D. Power and Associates use the proportion of owners likely to 
repurchase a particular automobile as an indication of customer loyalty for the automobile. 
An automobile with a greater proportion of owners likely to repurchase is considered to 
have greater customer loyalty. Suppose that in a particular study we want to compare the 
customer loyalty for three automobiles: Chevrolet Impala, Ford Fusion, and Honda Accord. 
The current owners of each of the three automobiles form the three populations for the 
study. The three population proportions of interest are as follows: 


pı = proportion likely to repurchase for the population of Chevrolet Impala owners 
pə = proportion likely to repurchase for the population of Ford Fusion owners 


P3 = proportion likely to repurchase for the population of Honda Accord owners 
The hypotheses are stated as follows: 


Ay: Py = P2 = Ps 
H: Not all population proportions are equal 


In studies such as these, To conduct this hypothesis test we begin by taking a sample of owners from each of 
we often use the same the three populations. Thus we will have a sample of Chevrolet Impala owners, a sample 
sample size foreach pop- of Ford Fusion owners, and a sample of Honda Accord owners. Each sample provides 
ulation. We have chosen categorical data indicating whether the respondents are likely or not likely to repurchase 
different sample sizes in the automobile. The data for samples of 125 Chevrolet Impala owners, 200 Ford Fusion 
this example to show that owners, and 175 Honda Accord owners are summarized in Table 12.5. This table has two 
the chi-square test isnot rows for the responses Yes and No and three columns, one for each of the populations 
restricted to equal sample of automobile owners. The observed frequencies are summarized in the six cells of the 
sizes for each of the k table corresponding to each combination of the likely to repurchase responses and the 
populations. three populations. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


12.3 Testing for Equality of Three or More Population Proportions 535 


TABLE 12.5 Sample Results of Likely to Repurchase for Three Populations 
of Automobile Owners (Observed Frequencies) 


Automobile Owners 


æD : Chevrolet Impala Ford Fusion Honda Accord Total 
SS DATA file B 
AutoLoyalty Likely to Yes 69 120 123 312 
Repurchase No 56 80 52 188 
Total 125 200 175 500 


Using Table 12.5, we see that 69 of the 125 Chevrolet Impala owners said they were 
likely to repurchase a Chevrolet Impala; 120 of the 200 Ford Fusion owners said they 
were likely to repurchase a Ford Fusion; and 123 of the 175 Honda Accord owners said 
they were likely to repurchase a Honda Accord. Also, across all three samples, 312 of the 
500 owners in the study indicated that they were likely to repurchase their current auto- 
mobile. The question now is how to analyze the data in Table 12.5 to determine whether 
the hypothesis Ho: pi = p) = p; should be rejected. 

The data in Table 12.5 are the observed frequencies for each of the six cells that rep- 
resent the six combinations of the likely to repurchase response and the owner population. 
If we can determine the expected frequencies under the assumption H, is true, we can use 
a chi-square test statistic to determine whether there is a significant difference between the 
observed and expected frequencies. If a significant difference exists between the observed 
and expected frequencies, the hypothesis H, can be rejected and there is evidence that not 
all the population proportions are equal. 

Expected frequencies for the six cells of the table are based on the following rationale. 
First, we assume that the null hypothesis of equal population proportions is true. Then we 
note that the three samples include a total of 500 owners; for this group, 312 owners indi- 
cated that they were likely to repurchase their current automobile. Thus, 312/500 = .624 
is the overall proportion of owners indicating they are likely to repurchase their current 
automobile. If Hp: pı = p, = p3is true, .624 would be the best estimate of the proportion 
responding likely to repurchase for each of the automobile owner populations. So if the 
assumption of Hp is true, we would expect .624 of the 125 Chevrolet Impala owners, or 
.624(125) = 78 owners to indicate they are likely to repurchase the Impala. Using the 
.624 overall proportion, we would expect .624(200) = 124.8 of the 200 Ford Fusion own- 
ers and .624(175) = 109.2 of the Honda Accord owners to respond that they are likely to 
repurchase their respective model of automobile. 

The approach of computing a table of expected frequencies for this test of mul- 
tiple proportions is essentially the same as we used for expected frequencies in 
Section 12.2. The following formula can be used to provide the expected frequencies 
under the assumption that the null hypothesis concerning equality of the population 
proportions is true. 


EXPECTED FREQUENCIES UNDER THE ASSUMPTION H, IS TRUE 


(Row i Total)(Column j Total) 
i Sum of Sample Sizes 


(12.4) 


Using equation (12.4), we see that the expected frequency of Yes responses (row 1) for 
Honda Accord owners (column 3) would be e,;; = (Row 1 Total)(Column 3 Total)/(Sum of 
Sample Sizes) = (312)(175)/500 = 109.2. Use equation (12.4) to verify the other expected 
frequencies are as shown in Table 12.6. 
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TABLE 12.6 Expected Frequencies for Likely to Repurchase for Three 


Populations of Automobile Owners If H, Is True 


Automobile Owners 
Chevrolet Impala Ford Fusion Honda Accord Total 


Likely to Yes 78 124.8 109.2 J2 
Repurchase No a 752 65.8 188 
Total 125 200.0 175.0 500 


The test procedure for comparing the observed frequencies of Table 12.5 with the expec- 
ted frequencies of Table 12.6 involves the computation of the following chi-square statistic: 


CHI-SQUARE TEST STATISTIC 


x= >> U (12.5) 


a eij 


where 


f = observed frequency for the cell in row i and column j 


e; = expected frequency for the cell in row i and column j under the assumption 
Hy is true 


Note: In a chi-square test involving the equality of k population proportions, the above 
test statistic has a chi-square distribution with k — 1 degrees of freedom provided the 
expected frequency is 5 or more for each cell. 


Reviewing the expected frequencies in Table 12.6, we see that the expected frequency 
is at least five for each cell in the table. We therefore proceed with the computation of 
the chi-square test statistic. The calculations necessary to compute the value of the test 
statistic are shown in Table 12.7. In this case, we see that the value of the test statistic is 
X = 7.89. 


TABLE 12.7 Computation of the Chi-Square Test Statistic for the Test of Equal 


Population Proportions 


Squared Difference 


Observed Expected Squared Divided by 
Likely to Automobile Frequency Frequency Difference Difference Expected Frequency 
Repurchase? Owner (fi (ej j-e) (j-e (fj — e,)?/e;, 
Yes Impala 69 78.0 = C0) 81.00 1.04 
Yes Fusion 120 124.8 —4.8 23.04 0.18 
Yes Accord 123 109.2 13.8 190.44 1.74 
No Impala 56 47.0 9.0 81.00 172 
No Fusion 80 Se 4.8 23.04 0.31 
No Accord 252 65.8 1378 190.44 2.89 
Total 500 500.0 x? = 7.89 
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The chi-square test 
presented in this section 

is always a one-tailed test 
with the rejection of Hy oc- 
curring in the upper tail of 
the chi-square distribution. 
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We can use the upper tail area of the appropriate chi-square distribution and the p-value 
approach to determine whether the null hypothesis can be rejected. In the automobile 
brand loyalty study, the three owner populations indicate that the appropriate chi-square 
distribution has k — 1 = 3 — 1 = 2 degrees of freedom. Using row two of the chi-square 
distribution table, we have the following: 


Area in Upper Tail 10 .05 025 01 005 
X Value (2 df) 4.605 5.991 7.378 | 9.210 10.597 
X = 7.89 


We see the upper tail area at y° = 7.89 is between .025 and .01. Thus, the corresponding 
upper tail area or p-value must be between .025 and .01. With p-value = .05, we reject 
H and conclude that the three population proportions are not all equal and thus there is a 
difference in brand loyalties among the Chevrolet Impala, Ford Fusion, and Honda Accord 
owners. In the Using Excel subsection that follows, we will see that the p-value = .0193. 
Instead of using the p-value, we could use the critical value approach to draw the same 
conclusion. With a = .05 and 2 degrees of freedom, the critical value for the chi-square 
test statistic is X” = 5.991. The upper tail rejection region becomes 


Reject H, if x’ = 5.991 


With 7.89 = 5.991, we reject H,. Thus, the p-value approach and the critical value ap- 
proach provide the same hypothesis-testing conclusion. 

Let us now summarize the general steps that can be used to conduct a chi-square test for 
the equality of the population proportions for three or more populations. 


A CHI-SQUARE TEST FOR THE EQUALITY OF POPULATION PROPORTIONS 
FOR k = 3 POPULATIONS 


1. State the null and alternative hypotheses: 
AS jai = Py =p; 
H,: Not all population proportions are equal 
2. Select a random sample from each of the populations and record the observed 
frequencies, f;,, in a table with 2 rows and k columns. 


3. Assume the null hypothesis is true and compute the expected frequencies, ¢;. 
4. If the expected frequency, e;, is 5 or more for each cell, compute the test 


statistic: 5 
(fj TE ip) 


a 
a al L 


y 


5. Rejection rule: 
p-value approach: Reject H, if p-value = a 
Critical value approach: Reject Hy if X = xz 


where the chi-square distribution has k — 1 degrees of freedom and a is the 
level of significance for the test. 


A Multiple Comparison Procedure 


We have used a chi-square test to conclude that the population proportions for the three 
populations of automobile owners are not all equal. Thus, some differences among the 
population proportions exist and the study indicates that customer loyalties are not all the 
same for the Chevrolet Impala, Ford Fusion, and Honda Accord owners. To identify where 
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the differences between population proportions exist, we can begin by computing the three 
sample proportions as follows: 


Brand Loyalty Sample Proportions 
Chevrolet Impala pı = 69/125 = 5520 
Ford Fusion P = 120/200 = .6000 
Honda Accord P3 = 123/175 = .7029 


Since the chi-square test indicated that not all population proportions are equal, it is 
reasonable for us to proceed by attempting to determine where differences among the 
population proportions exist. For this we will rely on a multiple comparison procedure 
that can be used to conduct statistical tests between all pairs of population proportions. 
In the following, we discuss a multiple comparison procedure known as the Marascuilo 
procedure. This is a relatively straightforward procedure for making pairwise comparisons 
of all pairs of population proportions. We will demonstrate the computations required by 
this multiple comparison test procedure for the automobile customer loyalty study. 

We begin by computing the absolute value of the pairwise difference between sample 
proportions for each pair of populations in the study. In the three-population automobile 
brand loyalty study we compare populations | and 2, populations | and 3, and then popula- 
tions 2 and 3 using the sample proportions as follows: 


Chevrolet Impala and Ford Fusion 

Pı Z Po| = |-5520 — .6000| = .0480 
Chevrolet Impala and Honda Accord 
Pi = P3| = |.5520 — .7029| = .1509 
Ford Fusion and Honda Accord 


Pa — P3| = |-6000 — .7029| = .1029 


In a second step, we select a level of significance and compute the corresponding critical 
value for each pairwise comparison using the following expression. 


CRITICAL VALUES FOR THE MARASCUILO PAIRWISE COMPARISON PROCEDURE 
FOR k POPULATION PROPORTIONS 


For each pairwise comparison compute a critical value as follows: 
pp) pp) 
CV; = væ i( Pi) ie Df Pj (12.6) 
n; 


My 


where 
x2, = chi-square with a level of significance a and k — 1 degrees of freedom 
P; and p, = sample proportions for populations i and j 


n; and n; = sample sizes for populations i and j 


Using the chi-square distribution in Table 3 of Appendix B withk — 1 =3—-1=2 
degrees of freedom, and a .05 level of significance, we have X^% = 5.991. Now using the 
sample proportions p, = .5520, p, = .6000, and p} = .7029, the critical values for the 
three pairwise comparison tests are as follows: 


Chevrolet Impala and Ford Fusion 


5520(1 
CV,, = V5.991 J < 


—.5520)  .6000(1 — .6000) 
+ = .1380 
25 200 
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TABLE 12.8 Pairwise Comparison Tests for the Automobile Brand Loyalty Study 


Significant If 
Pairwise Comparison IP; — Pil CV, |P: — P;| > CV; 
Chevrolet Impala vs. Ford Fusion .0480 .1380 Not significant 
Chevrolet Impala vs. Honda Accord 11509 ley Significant 
Ford Fusion vs. Honda Accord .1029 .1198 Not significant 


Chevrolet Impala and Honda Accord 


5520(1 — .5520)  .7029(1 — .7029 
Ce VN] = ) Sa ) _ 1379 


Ford Fusion and Honda Accord 


.6000(1 — .6000)  .7029(1 — .7029 
Vy Cvs = V594 f: x m ( ) = 1198 


175 


If the absolute value of any pairwise sample proportion difference |p; — P| exceeds 
its corresponding critical value, CV; the pairwise difference is significant at the .05 level 
An Excel workbook that of significance and we can conclude that the two corresponding population proportions are 
eases the computational different. The final step of the pairwise comparison procedure is summarized in Table 12.8. 


burden of making these The conclusion from the pairwise comparison procedure is that the only significant 
pairwise comparison is difference in customer loyalty occurs between the Chevrolet Impala and the Honda Accord. 
on the website. Our sample results indicate that the Honda Accord had a greater population proportion of 


owners who say they are likely to repurchase the Honda Accord. Thus, we can conclude 
that the Honda Accord (p, = .7029) has a greater customer loyalty than the Chevrolet 
Impala (p, = .5520). 
. The results of the study are inconclusive as to the comparative loyalty of the Ford Fusion. 
DATA f ile While the Ford Fusion did not show significantly different results when compared to the 
Chevrolet Impala or Honda Accord, a larger sample may have revealed a significant differ- 
ence between Ford Fusion and the other two automobiles in terms of customer loyalty. It is 
not uncommon for a multiple comparison procedure to show significance for some pairwise 
comparisons and yet not show significance for other pairwise comparisons in the study. 


Using Excel to Conduct a Test of Multiple Proportions 


The Excel procedure used to test for the equality of three or more population proportions 
is essentially the same as the Excel procedure used to conduct a test of independence. The 
CHISQ.TEST function is used with the table of observed frequencies as one input and 
the table of expected frequencies as the other input. The function output is the p-value for 
the test. We illustrate using the automobile brand loyalty study. Refer to Figure 12.5 as we 
describe the tasks involved. The formula worksheet is in the background; the value work- 
sheet is in the foreground. 


(@ 


PairwiseComparisons 


Enter/Access Data: Open the file AutoLoyalty. The data are in cells B2:C501 and labels 
are in column A and cells B1:Cl. 


(Q 


DATA file 


AutoLoyalty 
Apply Tools: The observed frequencies have been computed in cells F5:H6 using Excel’s 


PivotTable tool (see Section 2.4 for details regarding how to use this tool). 


Enter Functions and Formulas: The Excel formulas in cells F12:H13 were used to 
compute the expected frequencies for each category. Once the observed and expected fre- 
quencies have been computed, Excel’s CHISQ.TEST function has been used in cell H15 
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FIGURE 12.5 | Excel Worksheet for the Automobile Loyalty Study 


B D s , o " (he: 
(foeerancoe ey 
21 ce 
32 Che pale ‘Comat of Owner 4 3 D E F r m RA 
i3 (Chevrotet tmpata Ford Pasion Hends Accord Totat if —— 
sa Yes 120 19 mS NT 
aS nd » 2 es 3 Ye. Coumt of Owner  Antomediie 
z ae a an ms No Likely Reperchase Chevrolet Impaa Fard Fusieatteods Accord Total 
s $ Yo. Ya o 13 3 32 
ge 6 Yo No “ w 1 
Bo anane ? Ya Teesi aoo w ns w 
no Liket Reperchase  Chrvreiet Impala Yard Fanion ends Accord i Ya 
Ru Ye EFN MPG SEI 9 Yo 
in No TIE EG “ORI w Astomete 
MD 
RET paii TESTES: Lt ko pee Nar ireak 
iets. i B w Ne a m “ss 
=o Se n Ya 
pia Honda Accont Ne s N = un 
a a: 
soo Ne 
Note: Rows 17-499 Pa ` 
saz 
are hidden. 


to compute the p-value for the test. The value worksheet shows that the resulting p-value 
is .0193. With œ = .05, we reject the null hypothesis that the three population proportions 


are equal. 


NOTES + COMMENTS 


1. In Chapter 10, we used the standard normal distribution each population is said to have a multinomial distribution. 
and the z test statistic to conduct hypothesis tests about The chi-square calculations for the expected frequencies, 
the proportions of two populations. However, the chi- ej and the test statistic, x’, are the same as shown in ex- 
square test introduced in this section can also be used to pressions (12.4) and (12.5). The only difference is that the 
conduct the hypothesis test that the proportions of two null hypothesis assumes that the multinomial distribution 
populations are equal. The results will be the same under for the response variable is the same for all populations. 
both test procedures and the value of the test statistic x? With r responses for each of the k populations, the chi- 
will be equal to the square of the value of the test statis- square test statistic has (r — 1)(k — 1) degrees of freedom. 
tic z. An advantage of the methodology in Chapter 10 is . The procedure for computing expected frequencies for 
that it can be used for either a one-tailed or a two-tailed the test of multiple proportions and for the test of inde- 
hypothesis about the proportions of two populations, pendence in Section 12.2 are the same and both tests em- 
whereas the chi-square test in this section can be used ploy the same chi-square test statistic. But a key difference 
only for two-tailed tests. is that the test of independence is based on one sample 

2. Each of the k populations in this section had two response from a single population. The test of multiple proportions 


outcomes, Yes or No. In effect, each population had a bi- 
nomial distribution with parameter p the population pro- 
portion of Yes responses. An extension of the chi-square 
procedure in this section applies when each of the k pop- 
ulations has three or more possible responses. In this case, 


EXERCISES 


Methods 


is based on k independent samples from k populations. 
Thus, with the test of multiple proportions we can control 
the sample size for each of the k population categories. 
With the test of independence, we control only the overall 


sample size. 


17. Use the sample data below to test the hypotheses 


Ao: Py = Pr = Ps 
H Not all population proportions are equal 


where p; is the population proportion of Yes responses for population i. Using a .05 level 


of significance, what is the p-value and what is your conclusion? 
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Populations 
Response 1 2 3 
Yes 150 150 96 
No 100 150 104 


18. Reconsider the observed frequencies in exercise 17. 
a. Compute the sample proportion for each population. 
b. Use the multiple comparison procedure to determine which population proportions 
differ significantly. Use a .05 level of significance. 


Applications 
19. Late Flight Comparison Across Airlines. The following sample data represent the 
number of late and on-time flights for Delta, United, and US Airways. 


Airline 
Flight Delta United US Airways 
Late So 51 56 
On Time 261 249 344 


a. Formulate the hypotheses for a test that will determine whether the population 
proportion of late flights is the same for all three airlines. 

b. Conduct the hypothesis test with a .05 level of significance. What is the p-value and 
what is your conclusion? 

c. Compute the sample proportion of late flights for each airline. What is the overall 
proportion of late flights for the three airlines? 

20. Electronic Component Supplier Quality Comparison. Benson Manufacturing is 
considering ordering electronic components from three different suppliers. The sup- 
pliers may differ in terms of quality in that the proportion or percentage of defective 
components may differ among the suppliers. To evaluate the proportion of defective 
components for the suppliers, Benson has requested a sample shipment of 500 com- 
ponents from each supplier. The number of defective components and the number of 
good components found in each shipment are as follows. 


Supplier 
Component A B G 
Defective 15 20 40 
Good 485 480 460 


a. Formulate the hypotheses that can be used to test for equal proportions of defective 
components provided by the three suppliers. 

b. Using a .05 level of significance, conduct the hypothesis test. What is the p-value 
and what is your conclusion? 

c. Conduct a multiple comparison test to determine whether there is an overall best 
supplier or if one supplier can be eliminated because of poor quality. 

21. Agricultural Contaminant Effect on Fish. Kate Sanders, a researcher in the depart- 
ment of biology at IPFW University, studied the effect of agriculture contaminants on 
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the stream fish population in northeastern Indiana. Specially designed traps collected 
samples of fish at each of four stream locations. A research question was, Did the 
differences in agricultural contaminants found at the four locations alter the proportion 
of the fish population by gender? Observed frequencies were as follows. 


Stream Locations 


Gender A B (E D 
Male 49 44 49 39 
Female 41 46 36 44 


a. Focusing on the proportion of male fish at each location, test the hypothesis that the 
population proportions are equal for all four locations. Use a .05 level of signifi- 
cance. What is the p-value and what is your conclusion? 

b. Does it appear that differences in agricultural contaminants found at the four loca- 
tions altered the fish population by gender? 


Exercise 22 shows a 22. Error Rates in Tax Preparation. A tax preparation firm is interested in comparing 
chi-square test can be the quality of work at two of its regional offices. The observed frequencies showing 
used when the hypothesis the number of sampled returns with errors and the number of sampled returns that 


is about the equality of two 


popülaiiön propoitionë were correct are as follows. 


Regional Office 


Return Office 1 Office 2 
Error 35 27 
Correct PA'S} 273 


a. What are the sample proportions of returns with errors at the two offices? 

b. Use the chi-square test procedure to see if there is a significant difference between 
the population proportion of error rates for the two offices. Test the null hypothesis 
HA: pı = p, with a .10 level of significance. What is the p-value and what is your 
conclusion? (Note: We generally use the chi-square test of equal proportions when 
there are three or more populations, but this example shows that the same chi- 
square test can be used for testing equal proportions with two populations.) 

c. In the Section 10.2, a z test was used to conduct the above test. Either a a test 
statistic or a z test statistic may be used to test the hypothesis. However, when we 
want to make inferences about the proportions for two populations, we generally 
prefer the z test statistic procedure. Refer to the Notes and Comments at the end 
of this section and comment on why the z test statistic provides the user with more 
options for inferences about the proportions of two populations. 

DATA fil 23. Use of Social Media. Social media is popular around the world. Statista provides 

I Me estimate of the number of social media users in various countries in 2017 as well as 
the projections for 2022. Assume that the results for surveys in the United Kingdom, 
China, Russia, and the United States are as follows. 


(U 


SocialMedia 


Country 
Use Social United United 
Media Kingdom China Russia States 
Yes 480 215 343 640 
No 320 285 357 360 
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Exercise 24 shows a 
chi-square test can also 
be used for multiple 
population tests when 

the categorical response 
variable has three or more 
outcomes. 


Summary 


24. 
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a. Conduct a hypothesis test to determine whether the proportion of adults using social 
media is equal for all four countries. What is the p-value? Using a .05 level of sig- 
nificance, what is your conclusion? 

b. What are the sample proportions for each of the four countries? Which country has 
the largest proportion of adults using social media? 

c. Using a .05 level of significance, conduct multiple pairwise comparison tests 
among the four countries. What is your conclusion? 

Supplier Quality: Three Inspection Outcomes. The Ertl Company is well-known for 

its high-quality die-cast metal alloy toy replicas of tractors and other farm equipment. 

As part of a periodic procurement evaluation, Ertl is considering purchasing parts for a 

toy tractor line from three different suppliers. The parts received from the suppliers are 

classified as having a minor defect, having a major defect, or being good. Test results 
from samples of parts received from each of the three suppliers are shown below. Note 
that any test with these data is no longer a test of proportions for the three supplier 
populations because the categorical response variable has three outcomes: minor 
defect, major defect, and good. 


Supplier 
A B (& 
Minor Defect 15 ils} 


Major Defect 5 11 5 
Good 130 126 124 


Part Tested 


Using the preceding data, conduct a hypothesis test to determine whether the 
distribution of defects is the same for the three suppliers. Use the chi-square test 
calculations as presented in this section with the exception that a table with r rows 
and c columns results in a chi-square test statistic with (r — 1)(c — 1) degrees of 
freedom. Using a .05 level of significance, what is the p-value and what is your 
conclusion? 


SUMMARY 


In this chapter we have introduced hypothesis tests for the following applications. 


1. 


Goodness of Fit Test. A test designed to determine whether the observed frequency 
distribution for a sample differs significantly from what we would expect given the 
hypothesized probabilities for each category. 


. Test of Independence. A test designed to determine whether the table of observed 


frequencies for a sample involving two categorical variables from the same population 
differs significantly from what we would expect if the two variables were independent 
as specified in the null hypothesis. 


. Test of Equality for Three or More Population Proportions. A test designed to determine 


whether three or more sample proportions differ significantly from what we would 
expect if the corresponding population proportions were equal as stated in the null 
hypothesis. 


The first two tests are based on a single sample from a single population. The third test 


is based on independent samples from three or more different populations. All tests use a 
chi-square (x) test statistic that is based on the differences between observed frequencies 
from a sample and the frequencies that we would expect if the null hypothesis was true. 
Large differences between observed and expected frequencies provide a large value for the 
chi-square test statistic and indicate that the null hypothesis should be rejected; thus, these 
chi-square tests are upper tailed tests. 
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GLOSSARY 
Goodness of fit test A chi-square test that can be used to test that a probability distribu- 
tion has a specific historical or theoretical probability distribution. This test was demon- 
strated for a multinomial probability distribution. 

Marascuilo procedure A multiple comparison procedure that can be used to test for a 
significant difference between pairs of population proportions. This test can be helpful in 
identifying differences between pairs of population proportions whenever the hypothesis of 
equal population proportions has been rejected. 

Multinomial probability distribution A probability distribution where each outcome be- 
longs to one of three or more categories. The multinomial probability distribution extends 
the binomial probability distribution from two to three or more outcomes per trial. 

Test of independence A chi-square test that can be used to test for the independence of 
two random variables. If the hypothesis of independence is rejected, it can be concluded 
that the random variables are associated or dependent. 


KEY FORMULAS 


Test Statistic for the Goodness of Fit Test 


k 
x= > (12.1) 


Expected Frequencies: Test of Independence 


(Row i Total)(Column j Total) 
7 Sample Size 


(12.2) 


Ci 
Chi-Square Test Statistic for Test of Independence and Test for Equality of Three or 
More Population Proportions 


=p 
¥=>> ia (12.3 and 12.5) 
i j 


ej 


Expected Frequencies: Test for Equality of Three or More Population Proportions 


(Row i Total)(Column j Total) 


ej; - (12.4) 
J Sum of Sample Sizes 
Critical Values for the Marascuilo Pairwise Comparison Procedure 
(1 = J) (d = j) 
cv,=Ve J de Sa E (12.6) 
n; n; 
SUPPLEMENTARY EXERCISES 
Ss DATA fi le 25. Better Business Bureau. Historically, the industries with the most complaints to the 
= Better Business Bureau have been banks, cable and satellite television companies, 


PRS collection agencies, cellular phone providers, and new car dealerships. The results for 


a sample of 200 complaints are contained in the file BBB. 

a. Construct a frequency distribution for the number of complaints by industry. 

b. Using a = .01, conduct a hypothesis test to determine whether the probability of a 
complaint is the same for the five industries. What is your conclusion? 

c. Drop the industry with the most complaints. Using œ = .05, conduct a hypothesis 
test to determine whether the probability of a complaint is the same for the remaining 
four industries. 
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26. Restaurant Entree Preferences. Bistro 65 is a chain of Italian restaurants with loca- 
tions in Ohio and Kentucky. The Bistro 65 menu has four categories of entrées: Pasta, 
Steak & Chops, Seafood, and Other (e.g., pizza, sandwiches). Historical data for the 
chain show that the probability a customer will order an entrée from one of the four 
categories is .4 for Pasta, .1 for Steak & Chops, .2 for Seafood, and .3 for Other. A 
new Bistro 65 restaurant has just opened in Dayton, Ohio, and the following purchase 
frequencies have been observed for the first 200 customers. 


Category Frequency 
Pasta 70 
Steak & Chops 30 
Seafood 50 
Other _50 
Total 200 


a. Conduct a hypothesis test to determine whether the order pattern for the new restau- 
rant in Dayton is the same as the historical pattern for the established Bistro 65 
restaurants. Use a = .05. 

b. If the difference in part (a) is significant, prepare a bar chart to show where the 
differences occur. Comment on any differences observed. 

27. Best-Selling Small Cars in America. Based on 2017 sales, the six top-selling 
compact cars are the Honda Civic, Toyota Corolla, Nissan Sentra, Hyundai Elantra, 
Chevrolet Cruze, and Ford Focus (New York Daily News). The 2017 market shares are 
Honda Civic 20%, Toyota Corolla 17%, Nissan Sentra 12%, Hyundai Elantra 10%, 
Chevrolet Cruze 10%, and Ford Focus 8%, with other small car models comprising the 
remaining 23%. A sample of 400 compact car sales in Chicago showed the following 
number of vehicles sold. 


Honda Civic 98 
Toyota Corolla 72 
Nissan Sentra 54 
Hyundai Elantra 44 
Chevrolet Cruze 42 
Ford Focus 25 
Others 65 


Use a goodness of fit test to determine whether the sample data indicate that the 
market shares for cars in Chicago are different than the market shares suggested by 
nationwide 2017 sales. Using a .05 level of significance, what is the p-value and what 
is your conclusion? If the Chicago market appears to differ significantly from nation- 
wide sales, which categories contribute most to this difference? 

28. Pace of Life Preference by Gender. A Pew Research Center survey asked respon- 
dents if they would rather live in a place with a slower pace of life or a place with a 
faster pace of life. The survey also asked the respondent’s gender. Consider the follow- 
ing sample data. 


Gender 
Preferred Pace of Life Male Female 
Slower 230 218 
No Preference 20 24 
Faster 90 48 
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a. Is the preferred pace of life independent of gender? Using a .05 level of signifi- 
cance, what is the p-value and what is your conclusion? 
b. Discuss any differences between the preferences of men and women. 

29. Church Attendance by Age Group. Bara Research Group conducted a survey about 
church attendance. The survey respondents were asked about their church attendance 
and asked to indicate their age. Use the sample data to determine whether church 
attendance is independent of age. Using a .05 level of significance, what is the 
p-value and what is your conclusion? What conclusion can you draw about church 
attendance as individuals grow older? 


Age 
Church Attendance 20 to 29 30 to 39 40 to 49 50 to 59 
Yes 31 63 94 72 
No 69 87 106 78 
SS DATA fil 30. Ambulance Calls by Day of Week. An ambulance service responds to emergency 
= file EE . l 
Ambulans calls for two counties in Virginia. One county is an urban county and the other is a 


rural county. A sample of 471 ambulance calls over the past two years showed the 
county and the day of the week for each emergency call. Data are as follows. 


Day of Week 
County Sun Mon Tue Wed Thu Fri Sat 
Urban 61 48 50 55 63 73 43 
Rural 7 o) 16 13 2 14 10 


Test for independence of the county and the day of the week. Using a .05 level of 
significance, what is the p-value and what is your conclusion? 

31. Where Millionaires Live in America. In a 2018 study, Phoenix Marketing Inter- 
national identified Bridgeport, Connecticut; San Jose, California; Washington, DC; 
and Lexington Park, Maryland, as the four U.S. cities with the highest percentage of 
millionaires (Kiplinger website). Consider a sample of data that show the following 
number of millionaires for samples of individuals from each of the four cities. 


City 
Bridgeport, San Jose, Washington, Lexington Park, 
Millionaire CT CA D.C. MD 
Yes 44 35 35 34 
No 356 350 364 366 


a. What is the estimate of the percentage of millionaires in each of these cities? 
b. Using a .05 level of significance, test for the equality of the population proportion of 
millionaires for these four cities. What is the p-value and what is your conclusion? 
32. Quality Comparison Across Production Shifts. Arconic Inc is a producer of alumi- 
num components for the avionics and automotive industries. At its Davenport Works 
plant, an engineer has conducted a quality control test in which aluminum coils pro- 
duced in each of the three shifts were inspected. The study was designed to determine 
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whether the population proportion of good parts was the same for all three shifts. 
Sample data follow. 


Production Shift 


Quality First Second Third 
Good 285 368 176 
Defective l5 32 24 


a. Using a .05 level of significance, conduct a hypothesis test to determine whether 
the population proportion of good parts is the same for all three shifts. What is the 
p-value and what is your conclusion? 

b. If the conclusion is that the population proportions are not all equal, use a multiple 
comparison procedure to determine how the shifts differ in terms of quality. What 
shift or shifts need to improve the quality of parts produced? 

33. Ratings of Most Visited Art Museums. As listed by The Art Newspaper Visitor 
Figures Survey, the five most visited art museums in the world are the Louvre Mu- 
seum, the National Museum in China, the Metropolitan Museum of Art, the Vatican 
Museums, and the British Museum (The Art Newspaper, April 2018). Which of these 
five museums would visitors most frequently rate as spectacular? Samples of recent 
visitors of each of these museums were taken, and the results of these samples follow. 


National Metropolitan 


Louvre Museum in Museum Vatican British 
Museum China of Art Museums Museum 
Spectacular 113 88 94 98 96 
Not Spectacular ay 4A 46 J2 64 


a. Use the sample data to calculate the point estimate of the population proportion of 
visitors who rated each of these museums as spectacular. 

b. Conduct a hypothesis test to determine if the population proportion of visitors who 
rated the museum as spectacular is equal for these five museums. Using a .05 level 
of significance, what is the p-value and what is your conclusion? 


CASE PROBLEM 1: A BIPARTISAN AGENDA 
FOR CHANGE 


In a study conducted by Zogby International for the Democrat and Chronicle, more 
than 700 New Yorkers were polled to determine whether the New York state government 
works. Respondents surveyed were asked questions involving pay cuts for state legislators, 
restrictions on lobbyists, term limits for legislators, and whether state citizens should be 
able to put matters directly on the state ballot for a vote. The results regarding several pro- 
posed reforms had broad support, crossing all demographic and political lines. 

Suppose that a follow-up survey of 100 individuals who live in the western region of 
New York was conducted. The party affiliation (Democrat, Independent, Republican) of each 
individual surveyed was recorded, as well as their responses to the following three questions. 


1. Should legislative pay be cut for every day the state budget is late? 
Yes No 
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2. Should there be more restrictions on lobbyists? 
Yes No 

3. Should there be term limits requiring that legislators serve a fixed number of years? 
Yes No 


The responses were coded using 1 for a Yes response and 2 for a No response. The com- 
plete data set is available in the file NYReform. 


Managerial Report 

1. Use descriptive statistics to summarize the data from this study. What are your prelimi- 
nary conclusions about the independence of the response (Yes or No) and party affilia- 
tion for each of the three questions in the survey? 

2. With regard to question 1, test for the independence of the response (Yes and No) and 
party affiliation. Use a = .0S. 

3. With regard to question 2, test for the independence of the response (Yes and No) and 
party affiliation. Use a = .0S. 

4. With regard to question 3, test for the independence of the response (Yes and No) and 
party affiliation. Use a = .0S. 

5. Does it appear that there is broad support for change across all political lines? Explain. 


CASE PROBLEM 2: FUENTES SALTY SNACKS, INC. 
Six months ago Fuentes Salty Snacks, Inc. added a new flavor to its line of potato chips. 
The new flavor, candied bacon, was introduced through a nationwide rollout supported by 
an extensive promotional campaign. Fuentes’ management is convinced that quick penetra- 
tion into grocery stores is a key to the successful introduction of a new salty snack product, 
and management now wants to determine whether availability of Fuentes’ Candied Bacon 
Potato Chips is consistent in grocery stores across regions of the United States. Fuentes 
Marketing department has selected random samples of 40 grocery stores in each of its eight 
U.S. sales regions: 


e New England (Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, 
and Vermont) 

e Mid-Atlantic (New Jersey, New York, and Pennsylvania) 

e Midwest (Illinois, Indiana, Michigan, Ohio, and Wisconsin) 

e Great Plains (Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, Oklahoma, 
and South Dakota) 

e South Atlantic (Delaware, Florida, Georgia, Maryland, North Carolina, South Carolina, 
Virginia, Washington, D.C., and West Virginia) 

e Deep South (Alabama, Arkansas, Kentucky, Louisiana, Mississippi, Tennessee, and 
Texas) 

e Mountain (Arizona, Colorado, Idaho, Montana, Nevada, New Mexico, Utah, and 
Wyoming) 

e Pacific (Alaska, California, Hawaii, Oregon, and Washington) 


The stores in each sample were then contacted, and the manager of each store was asked 
whether the store currently carries Fuentes’ Candied Bacon Potato Chips. The complete 
data set is available in the file FuentesChips. 

Fuentes’ senior management now wants to use these data to assess whether penetration 
of Fuentes’ Candied Bacon Potato Chips in grocery stores is consistent across its eight 
U.S. sales regions. If penetration of Fuentes’ Candied Bacon Potato Chips in grocery stores 
differs across its eight U.S. sales regions, Fuentes’ management would also like to identify 
sales regions in which penetration of Fuentes’ Candied Bacon Potato Chips is lower or 
higher than expected. 
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Managerial Report 
Prepare a managerial report that addresses the following issues. 


1. Use descriptive statistics to summarize the data from Fuentes’ study. Based on your de- 
scriptive statistics, what are your preliminary conclusions about penetration of Fuentes’ 
Candied Bacon Potato Chips in grocery stores across its eight U.S. sales regions? 

2. Use the data from Fuentes’ study to test the hypothesis that the proportion of grocery 
stores that currently carries Fuentes’ Candied Bacon Potato Chips is equal across its 
eight U.S. sales regions. Use a = .05. 

3. Do the results of your hypothesis test provide evidence that penetration of Fuentes’ 
Candied Bacon Potato Chips in grocery stores differs across its eight U.S. sales regions? 
In which sales region(s) is penetration of Fuentes’ Candied Bacon Potato Chips lower 
or higher than expected? Use the Marascuilo pairwise comparison procedure at a = .05 
to test for differences between regions. 


CASE PROBLEM 3: FRESNO BOARD GAMES 
Fresno Board Games manufactures and sells several different board games online and 
through department stores nationwide. Fresno’s most popular game, ;Cabestrillo Cinco], is 
played with 5 six-sided dice. Fresno has purchased dice for this game from Box Cars, Ltd. 
for 25 years, but the company is now considering a move to Big Boss Gaming, Inc. (BBG), 
a new supplier that has offered to sell dice to Fresno at a substantially lower price. Fresno 
management is intrigued by the potential savings offered by BBG, but is also concerned 
about the quality of the dice produced by the new supplier. Fresno has a reputation for 
high integrity, and its management feels that it is imperative that the dice included with 
jCabestrillo Cinco! are fair. 

: To alleviate concerns about the quality of the dice it produces, BBG allows Fresno’s 

DATA f ile Manager of Product Quality to randomly sample 5 dice from its most recent production 

run. While being observed by several members of the BBG management team, Fresno’s 

Manager of Product Quality rolls each of these 5 randomly selected dice 500 times and 

records each outcome. The results for each of these 5 randomly selected dice are available 

in the file BBG. 

Fresno management now wants to use these data to assess whether any of these 5 six- 
sided dice is not fair; i.e., does one outcome occur more frequently or less frequently than 
the other outcomes? 


(@ 


BBG 


Managerial Report 
Prepare a managerial report that addresses the following issues. 


1. Use descriptive statistics to summarize the data collected by Fresno’s Manager of 
Product Quality for each of the 5 randomly selected dice. Based on these descrip- 
tive statistics, what are your preliminary conclusions about the fairness of the 5 
selected dice? 

2. Use the data collected by Fresno’s Manager of Product Quality to test the hypothesis 
that the first of the 5 randomly selected dice is fair, i.e., the distribution of outcomes 
for the first of the 5 randomly selected dice is multinomial with p, = p, = p} = py = 
Ps = Po = 1/6. Repeat this process for each of the other 4 randomly selected dice. Use 
a = .01. Do the results of your hypothesis tests provide evidence that BBG is produ- 
cing unfair dice? 
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STATISTICS IN PRACTICE 


Burke, Inc.* 
CINCINNATI, OHIO 


Burke Inc. is one of the most experienced market 
research firms in the industry. Burke writes more 
proposals, on more projects, every day than any other 
market research company in the world. Supported by 
state-of-the-art technology, Burke offers a wide variety 
of research capabilities, providing answers to nearly any 
marketing question. 

In one study, a firm retained Burke to evaluate po- 
tential new versions of a children’s dry cereal. To main- 
tain confidentiality, we refer to the cereal manufacturer 
as the Anon Company. The four key factors that Anon’s Burke uses taste tests to provide valuable statistical infor- 


product developers thought would enhance the taste of mation on what customers want from a product. 
the cereal were the following: Skydive Erick/Shutterstock.com 


Analysis of variance was the statistical method 
used to study the data obtained from the taste tests. 
The results of the analysis showed the following: 


1. Ratio of wheat to corn in the cereal flake 
2. Type of sweetener: sugar, honey, or artificial 
3. Presence or absence of flavor bits with a 


fruit taste e The flake composition and sweetener type were 
4. Short or long cooking time highly influential in taste evaluation. 
e The flavor bits actually detracted from the taste of 


Burke designed an experiment to determine what 
effects these four factors had on cereal taste. For 
example, one test cereal was made with a specified 


the cereal. 
e The cooking time had no effect on the taste. 


ratio of wheat to corn, sugar as the sweetener, flavor This information helped Anon identify the factors that 
bits, and a short cooking time; another test cereal was would lead to the best-tasting cereal. 

made with a different ratio of wheat to corn and the The experimental design employed by Burke and the 
other three factors the same, and so on. Groups of subsequent analysis of variance were helpful in making a 
children then taste-tested the cereals and stated what product design recommendation. In this chapter, we will 
they thought about the taste of each. see how such procedures are carried out. 


*The authors are indebted to Dr. Ronald Tatham of Burke Inc. 
for providing this Statistics in Practice. 


In Chapter 1 we stated that statistical studies can be classified as either experimental or 
observational. In an experimental statistical study, an experiment is conducted to generate 
the data. An experiment begins with identifying a variable of interest. Then one or more 
other variables, thought to be related, are identified and controlled, and data are collected 
about how those variables influence the variable of interest. 

In an observational study, data are usually obtained through sample surveys and not a 
controlled experiment. Good design principles are still employed, but the rigorous controls 
associated with an experimental statistical study are often not possible. 

To illustrate the differences between an observational study and an experimental study, 
consider the product introduction options for a new high-intensity flashlight produced by 
the High Lumens Flashlight Company (HLFC). The new model, referred to as the HLS, is 
powered by a rechargeable lithium-ion battery. A large nationwide chain of sporting goods 
stores has agreed to help HLFC determine the best promotional strategy for the new flash- 
light. HLFC is considering three promotional strategies: giving each customer who purchases 
an HLS a coupon for a 20% discount at checkout; giving each customer who purchases an 
HLS a coupon for a free rechargeable battery; and no promotional offer. In order to assess the 
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An association between 
two variables is necessary 
but not sufficient for estab- 
lishing a causal relationship 
between the two variables. 


Sir Ronald Aylmer Fisher 
(1890-1962) invented the 
branch of statistics known 
as experimental design. 

In addition to being 
accomplished in statistics, 
he was a noted scientist in 
the field of genetics. 


Cause-and-effect relation- 
ships can be difficult to 
establish in observational 
studies; such relationships 
are easier to establish in 


experimental studies. 
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An Introduction to Experimental Design and Analysis of Variance 


effect of these three strategies on sales, HLFC randomly selected 60 sporting goods stores 
to participate in the study. Two options for determining which promotional strategy to use in 
each of these stores are being considered: 


e Allow each store to select any one of the three promotional strategies and then record 
sales for the thirty-day trial period. 

e Randomly assign each of the three promotional strategies to 20 of the 60 stores and 
then record sales for the thirty-day trial period. 


HLFC’s first option is an observational study, because the 60 stores participating in the 
study are not each randomly assigned to one of the three promotional strategies. If HLFC 
employs this option, sales for the thirty-day trial period may be systematically biased by 
factors other than the promotional strategy that is used at each store. For example, some 
store managers may have a bias against using the 20% discount at checkout promotion 
because they have not had great success with such promotions in the past. Other store 
managers may have a bias against offering a coupon for a free rechargeable battery because 
of the paperwork involved in processing such offers. HLFC will not be able to determine 
whether differences in sales can be explained by the effect of the type of promotion used. 

HLFC’s second option is an experimental study, because each of the 60 stores par- 
ticipating in the study is randomly assigned to one of the three promotional strategies. 

If HLFC uses this option, the likelihood of systematic biases affecting sales is greatly 
reduced. Therefore, if differences in sales are observed for the three strategies, HLFC will 
have a much stronger case for concluding that such differences can be explained by the 
effect of the type of promotion used. 

It may be difficult and/or expensive to design and conduct an experimental study for 
some research questions. And, in some instances, it may be impractical or even impos- 
sible. For instance, in a study of the relationship between smoking and lung cancer, the 
researcher cannot randomly assign a smoking habit to each subject. The researcher is 
restricted to observing the outcomes for people who already smoke and the outcomes for 
people who do not already smoke and observing how frequently members of these two 
groups develop lung cancer. The design of this study prohibits the researcher from drawing 
a conclusion about whether smoking causes lung cancer; the researcher can only assess 
whether there is an association between smoking and lung cancer. 

In this chapter we introduce three types of experimental designs: a completely ran- 
domized design, a randomized block design, and a factorial experiment. For each design 
we show how a statistical procedure called analysis of variance (ANOVA) can be used to 
analyze the data available. ANOVA can also be used to analyze the data obtained through 
an observational study. For instance, we will see that the ANOVA procedure used for a 
completely randomized experimental design also works for testing the equality of three 
or more population means when data are obtained through an observational study. In the 
following chapters we will see that ANOVA plays a key role in analyzing the results of 
regression studies involving both experimental and observational data. 

In the first section, we introduce the basic principles of an experimental study and show 
how they are employed in a completely randomized design. In the second section, we then 
show how ANOVA can be used to analyze the data from a completely randomized experi- 
mental design. In later sections we discuss multiple comparison procedures and two other 
widely used experimental designs, the randomized block design and the factorial experiment. 


13.1 An Introduction to Experimental Design 
and Analysis of Variance 


As an example of an experimental statistical study, let us consider the problem facing Che- 
mitech, Inc. Chemitech developed a new filtration system for municipal water supplies. The 
components for the new filtration system will be purchased from several suppliers, and Che- 
mitech will assemble the components at its plant in Columbia, South Carolina. 
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The industrial engineering group is responsible for determining the best assembly method for 
the new filtration system. After considering a variety of possible approaches, the group nar- 
rows the alternatives to three: method A, method B, and method C. These methods differ in 
the sequence of steps used to assemble the system. Managers at Chemitech want to determine 
which assembly method can produce the greatest number of filtration systems per week. 

In the Chemitech experiment, assembly method is the independent variable or factor. 
Because three assembly methods correspond to this factor, we say that three treatments are 
associated with this experiment; each treatment corresponds to one of the three assembly 
methods. The Chemitech problem is an example of a single-factor experiment; it involves 
one categorical factor (method of assembly). More complex experiments may consist of 
multiple factors; some factors may be categorical and others may be quantitative. 

The three assembly methods or treatments define the three populations of interest for 
the Chemitech experiment. One population is all Chemitech employees who use assembly 
method A, another is those who use method B, and the third is those who use method C. 
Note that for each population the dependent or response variable is the number of filtra- 
tion systems assembled per week, and the primary statistical objective of the experiment is 
to determine whether the mean number of units produced per week is the same for all three 
populations (methods). 


Randomization is the Suppose a random sample of three employees is selected from all assembly workers at 
process of assigning the Chemitech production facility. In experimental design terminology, the three randomly 
the treatments to the selected workers are the experimental units. The experimental design that we will use for 
experimental units at the Chemitech problem is called a completely randomized design. This type of design 
random. Prior to the requires that each of the three assembly methods or treatments be assigned randomly to 
work of Sir R. A. Fisher, | one of the experimental units or workers. For example, method A might be randomly 
treatments were assigned to the second worker, method B to the first worker, and method C to the third 
assigned ona systematic Worker. The concept of randomization, as illustrated in this example, is an important prin- 
or subjective basis. ciple of all experimental designs. 


Note that this experiment would result in only one measurement or number of units 
assembled for each treatment. To obtain additional data for each assembly method, we 
must repeat or replicate the basic experimental process. Suppose, for example, that instead 
of selecting just three workers at random we selected 15 workers and then randomly as- 
signed each of the three treatments to 5 of the workers. Because each method of assembly 
is assigned to 5 workers, we say that five replicates have been obtained. The process of 
replication is another important principle of experimental design. Figure 13.1 shows the 
completely randomized design for the Chemitech experiment. 


Data Collection 


Once we are satisfied with the experimental design, we proceed by collecting and 
analyzing the data. In the Chemitech case, the employees would be instructed in how 
to perform the assembly method assigned to them and then would begin assembling the 
new filtration systems using that method. After this assignment and training, the num- 
ber of units assembled by each employee during one week is as shown in Table 13.1. 
The sample means, sample variances, and sample standard deviations for each as- 
sembly method are also provided. Thus, the sample mean number of units produced 
using method A is 62; the sample mean using method B is 66; and the sample mean 
using method C is 52. From these data, method B appears to result in higher production 
rates than either of the other methods. 

The real issue is whether the three sample means observed are different enough for us to 
conclude that the means of the populations corresponding to the three methods of assembly 
are different. To write this question in statistical terms, we introduce the following notation. 


u = mean number of units produced per week using method A 
H = mean number of units produced per week using method B 
H = mean number of units produced per week using method C 
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FIGURE 13.1 Completely Randomized Design for Evaluating the Chemitech 
Assembly Method Experiment 


Employees at the plant in 
Columbia, South Carolina 


Random sample of 15 employees 
is selected for the experiment 


Each of the three assembly methods 
is randomly assigned to 5 employees 


Method A 
y= 3) 


Method B 
Dy = 


Method C 
n,=5 


Although we will never know the actual values of u, u, and u,, we want to use the 
sample means to test the following hypotheses. 


Aly: by = M = M3 


If Ho is rejected, we cannot . 
H: Not all population means are equal 


conclude that all popula- 
tion means are different. 
Rejecting Hy means that at As we will demonstrate shortly, analysis of variance (ANOVA) is the statistical procedure 
least two population means used to determine whether the observed differences in the three sample means are large 
have different values. enough to reject Ho. 


TABLE 13.1 Number of Units Produced by 15 Workers 


S pata file r ai : 
Chemitech 
58 58 48 
64 69 57 
55 71 59 
66 64 47 
67 68 49 
Sample mean 62 66 52 
Sample variance 25 26.5 3110) 
Sample standard deviation 5.244 5.148 5.568 
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Assumptions for Analysis of Variance 


Three assumptions are required to use analysis of variance. 


If the sample sizes are 1. For each population, the response variable is normally distributed. Implication: 
equal, analysis of variance In the Chemitech experiment the number of units produced per week (response 

is not sensitive to depar- variable) must be normally distributed for each assembly method. 

tures from the assumption 2. The variance of the response variable, denoted o°, is the same for all of the 

of normally distributed populations. Implication: In the Chemitech experiment, the variance of the number 
populations. of units produced per week must be the same for each assembly method. 


3. The observations must be independent. Implication: In the Chemitech experi- 
ment, the number of units produced per week for each employee must be indepen- 
dent of the number of units produced per week for any other employee. 


Analysis of Variance: A Conceptual Overview 


If the means for the three populations are equal, we would expect the three sample means 
to be close together. In fact, the closer the three sample means are to one another, the 
weaker the evidence we have for the conclusion that the population means differ. Alterna- 
tively, the more the sample means differ, the stronger the evidence we have for the conclu- 
sion that the population means differ. In other words, if the variability among the sample 
means is “small,” it supports Ho; if the variability among the sample means is “large,” it 
supports H,. 

If the null hypothesis, Hp: u; = u, = M3, is true, we can use the variability among the 
sample means to develop an estimate of o°. First, note that if the assumptions for analysis 
of variance are satisfied and the null hypothesis is true, each sample will have come from 
the same normal distribution with mean and variance g°. Recall from Chapter 7 that the 
sampling distribution of the sample mean x for a simple random sample of size n from a 
normal population will be normally distributed with mean yw and variance o7/n. Figure 13.2 
illustrates such a sampling distribution. 

Thus, if the null hypothesis is true, we can think of each of the three sample means, 

X, = 62, x, = 66, and x, = 52 from Table 13.1, as values drawn at random from the sampling 
distribution shown in Figure 13.2. In this case, the mean and variance of the three x values 
can be used to estimate the mean and variance of the sampling distribution. When the sample 
sizes are equal, as in the Chemitech experiment, the best estimate of the mean of the sampling 


FIGURE 13.2 Sampling Distribution of x Given Ho Is True 


Sample means are “close 
together” because there is only 
one sampling distribution 
when Hh, is true 
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distribution of x is the mean or average of the sample means. In the Chemitech experiment, an 
estimate of the mean of the sampling distribution of x is (62 + 66 + 52)/3 = 60. We refer to 
this estimate as the overall sample mean. An estimate of the variance of the sampling distribu- 
tion of x, 02, is provided by the variance of the three sample means. 

a _ (62 — 60)" + (66 — 60)’ + (52 — 60)? 104 


2 =52 
sz Jai 5° 


Because o = o°/n, solving for o° gives 


Hence, 
Estimate of o? = n (Estimate of iz = ns= = 5(52) = 260 


The result, ns? = 260, is referred to as the between-treatments estimate of o°. 

The between-treatments estimate of o° is based on the assumption that the null hypo- 
thesis is true. In this case, each sample comes from the same population, and there is only 
one sampling distribution of x. To illustrate what happens when H; is false, suppose the 
population means all differ. Note that because the three samples are from normal popula- 
tions with different means, they will result in three different sampling distributions. 

Figure 13.3 shows that in this case, the sample means are not as close together as they were 
when H, was true. Thus, s? will be larger, causing the between-treatments estimate of o” 

to be larger. In general, when the population means are not equal, the between-treatments 
estimate will overestimate the population variance o°. 

The variation within each of the samples also has an effect on the conclusion we reach in 
analysis of variance. When a random sample is selected from each population, each of the 
sample variances provides an unbiased estimate of o”. Hence, we can combine or pool the in- 
dividual estimates of g? into one overall estimate. The estimate of o° obtained in this way is 
called the pooled or within-treatments estimate of o°. Because each sample variance provides 
an estimate of g? based only on the variation within each sample, the within-treatments esti- 
mate of g? is not affected by whether the population means are equal. When the sample sizes 
are equal, the within-treatments estimate of o° can be obtained by computing the average of 
the individual sample variances. For the Chemitech experiment we obtain 

27.5 + 26.5 +31.0 85 


Within-treatments estimate of o? = 3 = gam 28.33 


FIGURE 13.3 Sampling Distributions of x Given Hy Is False 


LAX 


X3 H3 Hi xX, X, My 


Sample means come from 
different sampling distributions 
and are not as close together when 
Hy is false 
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In the Chemitech experiment, the between-treatments estimate of o? (260) is much 
larger than the within-treatments estimate of o° (28.33). In fact, the ratio of these two 
estimates is 260/28.33 = 9.18. Recall, however, that the between-treatments approach 
provides a good estimate of o° only if the null hypothesis is true; if the null hypothesis is 


false, the between-treatments approach overestimates o°. The within-treatments approach 
provides a good estimate of o° in either case. Thus, if the null hypothesis is true, the two 
estimates will be similar and their ratio will be close to 1. If the null hypothesis is false, the 
between-treatments estimate will be larger than the within-treatments estimate, and their ra- 
tio will be large. In the next section we will show how large this ratio must be to reject Hp. 

In summary, the logic behind ANOVA is based on the development of two independent 
estimates of the common population variance o°. One estimate of g? is based on the vari- 
ability among the sample means themselves, and the other estimate of o° is based on the 
variability of the data within each sample. By comparing these two estimates of o°, we will 
be able to determine whether the population means are equal. 


NOTES + COMMENTS 


1. 


Randomization in experimental design is the analog of 
probability sampling in an observational study. 

In many medical experiments, potential bias is eliminated 
by using a double-blind experimental design. With this 
design, neither the physician applying the treatment nor 
the subject knows which treatment is being applied. Many 
other types of experiments could benefit from this type 
of design. 

In this section we provided a conceptual overview 
of how analysis of variance can be used to test for 


randomized experimental design. We will see that the 
same procedure can also be used to test for the equality 
of k population means for an observational or nonex- 
perimental study. 


. In Sections 10.1 and 10.2 we presented statistical meth- 


ods for testing the hypothesis that the means of two 
populations are equal. ANOVA can also be used to test 
the hypothesis that the means of two populations are 
equal. In practice, however, analysis of variance is usually 
not used except when dealing with three or more pop- 


the equality of k population means for a completely ulation means. 


13.2 Analysis of Variance and the Completely 
Randomized Design 


In this section, we show how analysis of variance can be used to test for the equality of k 
population means for a completely randomized design. The general form of the hypotheses 
tested is 


Aly: My = By = = My 
H: Not all population means are equal 


where 
H; = mean of the jth population 


We assume that a random sample of size n; has been selected from each of the k popu- 
lations or treatments. For the resulting sample data, let 


xX; = value of observation i for treatment j 
= number of observations for treatment j 
= sample mean for treatment j 

= sample variance for treatment j 


ae 
II 


sample standard deviation for treatment j 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


13.2 Analysis of Variance and the Completely Randomized Design 559 


The formulas for the sample mean and sample variance for treatment j are as follow. 


>i 
x = = (13.1) 


j f 
n 


nj 
P 
DCE) 


g= 5È (13.2) 


j — 
n; 1 


The overall sample mean, denoted x, is the sum of all the observations divided by the total 
number of observations. That is, 


k aM 
33s 
y= 2 (13.3) 
nr 
where 
n =n +n, t +n (13.4) 


If the size of each sample is n, n} = kn; in this case equation (13.3) reduces to 


Sry Dae 3a 


j=li=l j=li=l j=l 


kn k k 


z= (13.5) 
In other words, whenever the sample sizes are the same, the overall sample mean is just the 
average of the k sample means. 

Because each sample in the Chemitech experiment consists of n = 5 observations, the 
overall sample mean can be computed by using equation (13.5). For the data in Table 13.1 
we obtained the following result. 


62 + 66 + 52 
3 


60 


= Il 


If the null hypothesis is true (wu, = u, = p, = u), the overall sample mean of 60 is the best 
estimate of the population mean p. 


Between-Treatments Estimate of Population Variance 


In the preceding section, we introduced the concept of a between-treatments estimate of o” 
and showed how to compute it when the sample sizes were equal. This estimate of g? is 
called the mean square due to treatments and is denoted MSTR. The general formula for 
computing MSTR is 


k 
Xaa- x) 


MSTR = —-—__—. (13.6) 
k=l 


The numerator in equation (13.6) is called the sum of squares due to treatments and 
is denoted SSTR. The denominator, k — 1, represents the degrees of freedom asso- 
ciated with SSTR. Hence, the mean square due to treatments can be computed using 
the following formula. 
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MEAN SQUARE DUE TO TREATMENTS 


TR 
MSTR = ZR (13.7) 
where 
k 
SSTR = Sin, -ay (13.8) 
f= 


If Ho is true, MSTR provides an unbiased estimate of o°. However, if the means of the k 
populations are not equal, MSTR is not an unbiased estimate of a”; in fact, in that case, 
MSTR should overestimate o”. 

For the Chemitech data in Table 13.1, we obtain the following results. 


k 
SSTR = X n(x, — 5? = 5(62 — 60)? + 5(66 — 60)? + 5(52 — 60)? = 520 
j=l 


SSTR 520 
MSTR 2 
S e= 5 60 


Within-Treatments Estimate of Population Variance 


Earlier, we introduced the concept of a within-treatments estimate of a? and showed how to 
compute it when the sample sizes were equal. This estimate of g” is called the mean square 
due to error and is denoted MSE. The general formula for computing MSE is 


MSE = ———— (13.9) 


The numerator in equation (13.9) is called the sum of squares due to error and is denoted 
SSE. The denominator of MSE is referred to as the degrees of freedom associated with 
SSE. Hence, the formula for MSE can also be stated as follows. 


MEAN SQUARE DUE TO ERROR 


MSE = (13.10) 


where 


k 
SSE = X (n — 1)s? (13.11) 


all 


Note that MSE is based on the variation within each of the treatments; it is not influenced by 
whether the null hypothesis is true. Thus, MSE always provides an unbiased estimate of o°. 
For the Chemitech data in Table 13.1 we obtain the following results. 


k 
SSE = Ņ (n, — Is? = (5 — 1)27.5 + (5 — 126.5 + (5 — 1)31 = 340 
j=l 


SSE 340 340 
n-k 15-3 12 


MSE = = 28.33 
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Comparing the Variance Estimates: The F Test 


An introduction to the F If the null hypothesis is true, MSTR and MSE provide two independent, unbiased estimates 
distribution and the use of of o°. Based on the material covered in Chapter 11 we know that for normal populations, 
the F distribution table were the sampling distribution of the ratio of two independent estimates of o° follows an F 
presented in Section 11.2.. distribution. Hence, if the null hypothesis is true and the ANOVA assumptions are valid, 
the sampling distribution of MSTR/MSE is an F distribution with numerator degrees of 
freedom equal to k — 1 and denominator degrees of freedom equal to n} — k. In other 
words, if the null hypothesis is true, the value of MSTR/MSE should appear to have been 
selected from this F distribution. 
However, if the null hypothesis is false, the value of MSTR/MSE will be inflated because 
MSTR overestimates o”. Hence, we will reject H, if the resulting value of MSTR/MSE 
appears to be too large to have been selected from an F distribution with k — 1 numerator 
degrees of freedom and n} — k denominator degrees of freedom. Because the decision to 
reject H, is based on the value of MSTR/MSE, the test statistic used to test for the equality 
of k population means is as follows. 


TEST STATISTIC FOR THE EQUALITY OF k POPULATION MEANS 


_ MSTR 
MSE 


(13.12) 


The test statistic follows an F distribution with k — 1 degrees of freedom in the numerator 
and ny — k degrees of freedom in the denominator. 


Let us return to the Chemitech experiment and use a level of significance a = .05 to 
conduct the hypothesis test. The value of the test statistic is 
p= MSTR 260 
MSE 28.33 


= 9.18 


The numerator degrees of freedom is k — 1 = 3 — 1 = 2 and the denominator degrees 
of freedom is ny — k = 15 — 3 = 12. Because we will only reject the null hypothesis 

for large values of the test statistic, the p-value is the upper tail area of the F distribution 
to the right of the test statistic F = 9.18. Figure 13.4 shows the sampling distribution of 
F = MSTR/ MSE, the value of the test statistic, and the upper tail area that is the p-value 
for the hypothesis test. 


FIGURE 13.4 Computation of p-Value Using the Sampling Distribution 


of MSTR/MSE 


Sampling distribution 
of MSTR/MSE 


p-value 


MSTR/MSE 


fF = OI 
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From Table 4 of Appendix B we find the following areas in the upper tail of an F distri- 
bution with 2 numerator degrees of freedom and 12 denominator degrees of freedom. 


Area in Upper Tail 10 05 .025 .01 


F Value (df, = 2, df, = 12) 2.81 3.89 5.10 6.93 


\ 


F = 9.18 


Because F = 9.18 is greater than 6.93, the area in the upper tail at F = 9.18 is less than 
.01. Thus, the p-value is less than .01. Excel can be used to show that the p-value is .004. 
With p-value = a = .05, H, is rejected. The test provides sufficient evidence to conclude 
that the means of the three populations are not equal. In other words, analysis of variance 
supports the conclusion that the population mean number of units produced per week for 
the three assembly methods are not equal. 

As with other hypothesis testing procedures, the critical value approach may also be 
used. With a = .05, the critical F value occurs with an area of .05 in the upper tail of an 
F distribution with 2 and 12 degrees of freedom. From the F distribution table, we find 
Fo; = 3.89. Hence, the appropriate upper tail rejection rule for the Chemitech experiment is 


Reject Hy if F = 3.89 


With F = 9.18, we reject H, and conclude that the means of the three populations are 
not equal. A summary of the overall procedure for testing for the equality of k population 
means follows. 


TEST FOR THE EQUALITY OF k POPULATION MEANS 


Aly: My = By = = My 
H: Not all population means are equal 
TEST STATISTIC 
_ MSTR 
MSE 
REJECTION RULE 
p-value approach: Reject Hy if p-value = a 


Critical value approach: Reject Ho if F = F, 


where the value of F is based on an F distribution with k — 1 numerator degrees of 
freedom and n; — k denominator degrees of freedom. 


ANOVA Table 


The results of the preceding calculations can be displayed conveniently in a table referred 
to as the analysis of variance or ANOVA table. The general form of the ANOVA table for 
a completely randomized design is shown in Table 13.2; Table 13.3 is the corresponding 
ANOVA table for the Chemitech experiment. The sum of squares associated with the 
source of variation referred to as “Total” is called the total sum of squares (SST). Note that 
the results for the Chemitech experiment suggest that SST = SSTR + SSE, and that the 
degrees of freedom associated with this total sum of squares is the sum of the degrees of 
freedom associated with the sum of squares due to treatments and the sum of squares due 
to error. 
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Analysis of variance can be 
thought of as a statistical 
procedure for partitioning 
the total sum of squares 
into separate components. 
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TABLE 13.2 ANOVA Table for a Completely Randomized Design 


Source Sum Degrees Mean 

of Variation of Squares of Freedom Square F p-Value 
SSTR MSTR 

Treatments SSTR k=l MSTR = eT WISE 

Error SSE nr= k MSE = LE 
r-k 

Total SST n~ | 


TABLE 13.3 Analysis of Variance Table for the Chemitech Experiment 


Source Sum Degrees Mean 

of Variation of Squares of Freedom Square F p-Value 
Treatments 520 2 260.00 9.18 .004 
Error 340 12 28.33 

Total 860 14 


We point out that SST divided by its degrees of freedom ny — 1 is nothing more than the 
overall sample variance that would be obtained if we treated the entire set of 15 observations 
as one data set. With the entire data set as one sample, the formula for computing the total 
sum of squares, SST, is 


k on 7 

SST = 5 SG; -x (13.13) 
j=li=l 

It can be shown that the results we observed for the analysis of variance table for the 


Chemitech experiment also apply to other problems. That is, 


SST = SSTR + SSE (13.14) 


In other words, SST can be partitioned into two sums of squares: the sum of squares 
due to treatments and the sum of squares due to error. Note also that the degrees of 
freedom corresponding to SST, ny — 1, can be partitioned into the degrees of freedom 
corresponding to SSTR, k — 1, and the degrees of freedom corresponding to SSE, 

nr — k. The analysis of variance can be viewed as the process of partitioning the total 
sum of squares and the degrees of freedom into their corresponding sources: treatments 
and error. Dividing the sum of squares by the appropriate degrees of freedom provides 
the variance estimates, the F value, and the p-value used to test the hypothesis of equal 
population means. 


Excel’s Anova: Single Factor tool can be used to conduct a hypothesis test about the differ- 
ence between the population means for the Chemitech experiment. 


Enter/Access Data: Open the file Chemitech. The data are in cells A2:C6 and labels are 
in cells A1:C1. 


Apply Tools: The following steps describe how to use Excel’s Anova: Single Factor tool 
to test the hypothesis that the mean number of units produced per week is the same for all 
three methods of assembly. 
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FIGURE 13.5 | Excel’s ANOVA: Single Factor Tool Dialog Box for the 
Chemitech Experiment 


4A BO e e ac <a es moame eE 
1 [Method A] Method B Method C 
2 M 58 48 
3| 64 69 57 ON een en | 
4 55 71 59 Input 
5 ‘| 66 64 47 Bi Range: $A$1:$C$6 
6 | 67 68 49 Grouped By: @ columns 
7 © Rows 
8 | ! [ee] 
9 | Alpha: 0.0! 3 
10 | 
1 1| Output options 
12 | © Output Range: 
13 | F ) New Worksheet Ply: 
14 >) New Workbook 
15 
16 | 


Step 1. Click the Data tab on the Ribbon 
Step 2. In the Analyze group, click Data Analysis 
Step 3. Choose Anova: Single Factor from the list of Analysis Tools 
Step 4. When the Anova: Single Factor dialog box appears (see Figure 13.5): 
Enter A/:C6 in the Input Range: box 
Select Columns in the Grouped By: area 
Select the check box for Labels in First Row 
Enter .05 in the Alpha: box 
Select Output Range: in the Output options area 
Enter A8 in the Output Range: box (to identify the upper left corner of 
the section of the worksheet where the output will appear) 
Click OK 


The output, titled Anova: Single Factor, appears in cells A8:G22 of the worksheet shown 
in Figure 13.6. Cells A10:E14 provide a summary of the data. Note that the sample mean 
and sample variance for each method of assembly are the same as shown in Table 13.1. The 
ANOVA table, shown in cells A17:G22, is basically the same as the ANOVA table shown 
in Table 13.2. Excel identifies the treatments source of variation using the label Between 
Groups and the error source of variation using the label Within Groups. In addition, the 
Excel output provides the p-value associated with the test as well as the critical F value. 

We can use the p-value shown in cell F19, 0.0038, to make the hypothesis testing 
decision. Thus, at the œ = .05 level of significance, we reject H} because the p-value = 
0.0038 < a = .05. Hence, using the p-value approach we still conclude that the mean 
number of units produced per week are not the same for the three assembly methods. 


Testing for the Equality of k Population Means: 
An Observational Study 


We have shown how analysis of variance can be used to test for the equality of k population 
means for a completely randomized experimental design. It is important to understand that 
ANOVA can also be used to test for the equality of three or more population means using 
data obtained from an observational study. As an example, let us consider the situation at 
National Computer Products, Inc. (NCP). 
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FIGURE 13.6 | Excel's ANOVA: Single Factor Tool Output for the 
Chemitech Experiment 
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|Method B Method C 
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49 
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NCP manufactures printers and fax machines at plants located in Atlanta, Dallas, 
and Seattle. To measure how much employees at these plants know about quality man- 
agement, a random sample of 6 employees was selected from each plant and the em- 
ployees selected were given a quality awareness examination. The examination scores 
for these 18 employees are shown in Table 13.4. The sample means, sample variances, 


TABLE 13.4 Examination Scores for 18 Employees 


Plant 1 Plant 2 Plant 3 
Atlanta Dallas Seattle 
85 Val 59 
> F YS 75 64 
= DATA file 82 7B & 
NCP 76 74 69 
al 69 75 
85 82 67 
Sample mean To) 74 66 
Sample variance 34 20 32 
Sample standard deviation 5.83 4.47 5.66 
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and sample standard deviations for each group are also provided. Managers want to 
use these data to test the hypothesis that the mean examination score is the same for all 
three plants. 

We define population 1 as all employees at the Atlanta plant, population 2 as all 
employees at the Dallas plant, and population 3 as all employees at the Seattle plant. Let 


H; = mean examination score for population 1 
H = Mean examination score for population 2 


u, = mean examination score for population 3 


Although we will never know the actual values of u, “4, and 43, we want to use the sam- 
ple results to test the following hypotheses. 


Ho: by = My = Ms 
H: Not all population means are equal 


Note that the hypothesis test for the NCP observational study is exactly the same as the 
hypothesis test for the Chemitech experiment. Indeed, the same analysis of variance meth- 
odology we used to analyze the Chemitech experiment can also be used to analyze the data 
from the NCP observational study. 

Even though the same ANOVA methodology is used for the analysis, it is worth noting 
how the NCP observational statistical study differs from the Chemitech experimental 
statistical study. The individuals who conducted the NCP study had no control over how 
the plants were assigned to individual employees. That is, the plants were already in oper- 
ation and a particular employee worked at one of the three plants. All that NCP could do 
was to select a random sample of 6 employees from each plant and administer the quality 
awareness examination. To be classified as an experimental study, NCP would have had to 
be able to randomly select 18 employees and then assign the plants to each employee in a 
random fashion. 


NOTES + COMMENTS 


1. The overall sample mean can also be computed as a Note that this result is the same as presented in Sec- 


weighted average of the k sample means. 


NX, + NX +... + NLX, 


ny 


In problems where the sample means are provided, this 
formula is simpler than equation (13.3) for computing the 
overall mean. 


tion 13.1 when we introduced the concept of the 
between-treatments estimate of o°. Equation (13.6) 
is simply a generalization of this result to the unequal 


sample-size case. 


. If each sample has n observations, n;= kn; thus, 


n> k= k(n — 1), and equation (13.9) can be rewritten as 


2. If each sample consists of n observations, equation (13.6) a q a 
can be written as MSE k(n — 1) k(n — 1) k 
k k 
nX (x x)? DK = x)? In other words, if the sample sizes are the same, MSE is 
MSTR JI nl 2 the average of the k sample variances. Note that it is the 
= ne ad I same result we used in Section 13.1 when we introduced 
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the concept of the within-treatments estimate of o°. 
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EXERCISES 


Methods 
1. The following data are from a completely randomized design. 


Treatment 
A B E 
162 142 126 
142 156 122 
165 124 138 
145 142 140 
148 136 150 
174 152 128 
Sample mean 156 142 134 
Sample variance 164.4 1812 110.4 


Compute the sum of squares between treatments. 

Compute the mean square between treatments. 

Compute the sum of squares due to error. 

. Compute the mean square due to error. 

Set up the ANOVA table for this problem. 

At the a = .05 level of significance, test whether the means for the three treatments 

are equal. 

2. In a completely randomized design, seven experimental units were used for each of the 
five levels of the factor. Complete the following ANOVA table. 


HPRP 


Source Sum Degrees Mean 

of Variation of Squares of Freedom Square F  p-Value 
Treatments 300 

Error 

Total 460 


3. Refer to exercise 2. 
a. What hypotheses are implied in this problem? 
b. At the a = .05 level of significance, can we reject the null hypothesis in part (a)? 
Explain. 

4. In an experiment designed to test the output levels of three different treatments, 
the following results were obtained: SST = 400, SSTR = 150, n} = 19. Set up the 
ANOVA table and test for any significant difference between the mean output levels of 
the three treatments. Use a = .05. 

5. Inacompletely randomized design, 12 experimental units were used for the first 
treatment, 15 for the second treatment, and 20 for the third treatment. Complete the 
following analysis of variance. At a .05 level of significance, is there a significant 
difference between the treatments? 


Source Sum Degrees Mean 

of Variation of Squares of Freedom Square F  p-Value 
Treatments 1200 

Error 

Total 1800 
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6. Develop the analysis of variance computations for the following completely random- 
ized design. At a = .05, is there a significant difference between the treatment means? 


Treatment 
A B C 
136 107 92 
120 114 82 
ame : ns 125 85 
m = DATA file 107 104 101 
Exer6 131 107 89 
114 109 Wl 
129 97 HO 
102 114 120 
104 98 
89 106 
xj HW 107 100 
s2 146.86 96.44 173.78 


Applications 

7. Product Assembly. Three different methods for assembling a product were proposed 
by an industrial engineer. To investigate the number of units assembled correctly 
with each method, 30 employees were randomly selected and randomly assigned to 
the three proposed methods in such a way that each method was used by 10 workers. 
The number of units assembled correctly was recorded, and the analysis of variance 
procedure was applied to the resulting data set. The following results were obtained: 
SST = 10,800; SSTR = 4560. 

a. Set up the ANOVA table for this problem. 
b. Use a = .05 to test for any significant difference in the means for the three assem- 
bly methods. 

8. Testing Quality Awareness. Refer to the NCP data in Table 13.4. Set up the ANOVA 
table and test for any significant difference in the mean examination score for the three 
plants. Use a = .05. 

9. Testing Temperature on a Chemical Process. To study the effect of temperature on 
yield in a chemical process, five batches were produced at each of three temperature 
levels. The results follow. Construct an analysis of variance table. Use a .05 level of 
significance to test whether the temperature level has an effect on the mean yield of 
the process. 


Temperature 
50°C 60°C 70°C 
34 30 23 
24 3í 28 
36 34 28 
Sy) 23 30 
22 27 Sil 


10. Auditing Errors. Auditors must make judgments about various aspects of an audit on 
the basis of their own direct experience, indirect experience, or a combination of the 
two. In a study, auditors were asked to make judgments about the frequency of errors 
to be found in an audit. The judgments by the auditors were then compared to the 
actual results. Suppose the following data were obtained from a similar study; lower 
scores indicate better judgments. 
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Direct Indirect Combination 
O 16.6 2572 
18.5 222 24.0 
‘ 15:8 20.5 ZirS 
SDATAfile 182 18.3 26.8 
AudJudg 20.2 24.2 27.5 
16.0 19.8 25.8 
13.3 212 24.2 
Use a = .05 to test to see whether the basis for the judgment affects the quality of the 
judgment. What is your conclusion? 
11. Paint-Drying Robots. How long it takes paint to dry can have an impact on the 
production capacity of a business. In May 2018, Deal’s Auto Body & Paint in Prescott, 
Arizona, invested in a paint-drying robot to speed up its process (The Daily Courier 
website, https://www.dcourier.com/photos/2018/may/26/984960336/). An interesting 
question is, “Do all paint-drying robots have the same drying time?” To test this, sup- 
pose we sample five drying times for each of different brands of paint-drying robots. 
The time in minutes until the paint was dry enough for a second coat to be applied was 
recorded. The following data were obtained. 
Robot 1 Robot 2 Robot 3 Robot 4 
<=> : 128 144 33) 150 
Ss DATA file 137 133 143 142 
Paint 135 142 137, 135 
124 146 136 140 
141 130 131 153 
At the a = .05 level of significance, test to see whether the mean drying time is the 
same for each brand of robot. 
12. Restaurant Satisfaction. The Consumer Reports Restaurant Customer Satisfaction 
Survey is based upon 148,599 visits to full-service restaurant chains (Consumer 
Reports website). One of the variables in the study is meal price, the average amount 
paid per person for dinner and drinks, minus the tip. Suppose a reporter for the Sun 
Coast Times thought that it would be of interest to her readers to conduct a similar 
study for restaurants located on the Grand Strand section in Myrtle Beach, South 
Carolina. The reporter selected a sample of 8 seafood restaurants, 8 Italian restau- 
rants, and 8 steakhouses. The following data show the meal prices ($) obtained for the 
24 restaurants sampled. Use a = .05 to test whether there is a significant difference 
among the mean meal price for the three types of restaurants. 
Italian Seafood Steakhouse 
$12 $16 $24 
i 1 1 19 
SDATAfile i 7 23 
GrandStrand 17 26 25 
18 23 21 
20 15 22 
1y 19 27 
24 18 il 
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13.3 Multiple Comparison Procedures 


When we use analysis of variance to test whether the means of k populations are equal, 
rejection of the null hypothesis allows us to conclude only that the population means are not 
all equal. In some cases we will want to go a step further and determine where the differ- 
ences among means occur. The purpose of this section is to show how multiple comparison 
procedures can be used to conduct statistical comparisons between pairs of population means. 


Fisher’s LSD 


Suppose that analysis of variance provides statistical evidence to reject the null hypothesis 
of equal population means. In this case, Fisher’s least significant difference (LSD) proce- 
dure can be used to determine where the differences occur. To illustrate the use of Fisher’s 
LSD procedure in making pairwise comparisons of population means, recall the Chemitech 
experiment introduced in Section 13.1. Using analysis of variance, we concluded that the 
mean number of units produced per week are not the same for the three assembly methods. 
In this case, the follow-up question is: We believe the assembly methods differ, but where 
do the differences occur? That is, do the means of populations | and 2 differ? Or those of 
populations | and 3? Or those of populations 2 and 3? The following table summarizes 
Fisher’s LSD procedure for comparing pairs of population means. 


FISHER’S LSD PROCEDURE 


Ao: Mi = By 
HT, Hi Z Hj 
TEST STATISTIC 
g= — (13.16) 
1 i 
MSE se 
Me 
REJECTION RULE 
p-value approach: Reject Ho if p-value <= a 


Critical value approach: Reject Hy if t = —t,, or t = tyn 


where the value of t, is based on a ¢ distribution with n} — k degrees of freedom. 


Let us now apply this procedure to determine whether there is a significant difference 
between the means of population | (method A) and population 2 (method B) at the a = .05 
level of significance. Table 13.1 showed that the sample mean is 62 for method A and 66 for 
method B. Table 13.3 showed that the value of MSE is 28.33; it is the estimate of g? and is 
based on 12 degrees of freedom. For the Chemitech data the value of the test statistic is 


2 — 
t= 2 eid = —1.19 


1 1 
23,33) 4 = 
5 5 


Because we have a two-tailed test, the p-value is two times the area under the curve for 
the ¢ distribution to the left of t = —1.19. Using Table 2 in Appendix B, the ¢ distribution 
table for 12 degrees of freedom provides the following information. 


Area in Upper Tail .20 10 05 025 01 .005 
t Value (12 df) 873 —— 1.782 2.179 2.681 3.055 
t= 1.19 
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The f distribution table only contains positive tf values. Because the f distribution is symmetric, 
however, we can find the area under the curve to the right of t = 1.19 and double it to find the 
p-value corresponding to t = —1.19. We see that tf = 1.19 is between .20 and .10. Doubling 
these amounts, we see that the p-value must be between .40 and .20. Excel can be used to 
show that the p-value is .2571. Because the p-value is greater than a = .05, we cannot reject 
the null hypothesis. Hence, we cannot conclude that the population mean number of units 
produced per week for method A is different from the population mean for method B. 

Many practitioners find it easier to determine how large the difference between the 
sample means must be to reject Ho. In this case the test statistic is x; — x, and the test is 
conducted by the following procedure. 


FISHER’S LSD PROCEDURE BASED ON THE TEST STATISTIC x; — x; 


Ay: M; = By 
Hx M; * My 


TEST STATISTIC 
X;— %; 
REJECTION RULE AT A LEVEL OF SIGNIFICANCE a 
Reject Hy if |x; — x| = LSD 
where 


i 
LSD = fp mse( 2 4 1) (13.17) 


aes 
For the Chemitech experiment the value of LSD is 


rl 
LSD = 2.179 28.33( 4 + z) = 7.34 


Note that when the sample sizes are equal, only one value for LSD is computed. In such 
cases we can simply compare the magnitude of the difference between any two sample 
means with the value of LSD. For example, the difference between the sample means for 
population 1 (method A) and population 3 (method C) is 62 — 52 = 10. This difference is 
greater than LSD = 7.34, which means we can reject the null hypothesis that the popu- 
lation mean number of units produced per week for method A is equal to the population 
mean for method C. Similarly, with the difference between the sample means for popula- 
tions 2 and 3 of 66 — 52 = 14 > 7.34, we can also reject the hypothesis that the population 
mean for method B is equal to the population mean for method C. In effect, our conclusion 
is that methods A and B both differ from method C. 

Fisher’s LSD can also be used to develop a confidence interval estimate of the difference 
between the means of two populations. The general procedure follows. 


CONFIDENCE INTERVAL ESTIMATE OF THE DIFFERENCE BETWEEN TWO POPULATION 
MEANS USING FISHER’S LSD PROCEDURE 


X — X, + LSD (13.18) 


where 


Ioi 
SIDE mse( + (13.19) 


Nn; n; 


and ¢,,. is based on a ż distribution with n} — k degrees of freedom. 
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If the confidence interval in expression (13.18) includes the value zero, we cannot reject 
the hypothesis that the two population means are equal. However, if the confidence in- 
terval does not include the value zero, we conclude that there is a difference between the 
population means. For the Chemitech experiment, recall that LSD = 7.34 (corresponding 
to to), = 2.179). Thus, a 95% confidence interval estimate of the difference between the 
means of populations 1 and 2 is 62 — 66 + 7.34 = —4 + 7.34 = —11.34 to 3.34; because 
this interval includes zero, we cannot reject the hypothesis that the two population means 
are equal. 


Type | Error Rates 


We began the discussion of Fisher’s LSD procedure with the premise that analysis of vari- 
ance gave us Statistical evidence to reject the null hypothesis of equal population means. 
We showed how Fisher’s LSD procedure can be used in such cases to determine where 
the differences occur. Technically, it is referred to as a protected or restricted LSD test 
because it is employed only if we first find a significant F value by using analysis of 
variance. To see why this distinction is important in multiple comparison tests, we need to 
explain the difference between a comparisonwise Type I error rate and an experimentwise 
Type I error rate. 

In the Chemitech experiment we used Fisher’s LSD procedure to make three pairwise 
comparisons. 


Test 1 Test 2 Test 3 
Ao: by = By A: M; = Bs Ao: By = Bs 
Hy by * My Hx by * Ms Hx My * Ms 


In each case, we used a level of significance of a = .05. Therefore, for each test, if the null 
hypothesis is true, the probability that we will make a Type I error is a = .05; hence, the 
probability that we will not make a Type I error on each test is 1 — .05 = .95. In discussing 
multiple comparison procedures we refer to this probability of a Type I error (a = .05) as 
the comparisonwise Type I error rate; comparisonwise Type I error rates indicate the 
level of significance associated with a single pairwise comparison. 

Let us now consider a slightly different question. What is the probability that in making 
three pairwise comparisons, we will commit a Type I error on at least one of the three tests? 
To answer this question, note that the probability that we will not make a Type I error on any 
of the three tests is (.95)(.95)(.95) = .8574.! Therefore, the probability of making at least 
one Type I error is 1 — .8574 = .1426. Thus, when we use Fisher’s LSD procedure to make 
all three pairwise comparisons, the Type I error rate associated with this approach is not .05, 
but actually .1426; we refer to this error rate as the overall or experimentwise Type I error 
rate. To avoid confusion, we denote the experimentwise Type I error rate as apy. 

The experimentwise Type I error rate gets larger for problems with more populations. 
For example, a problem with five populations has 10 possible pairwise comparisons. If we 
tested all possible pairwise comparisons by using Fisher’s LSD with a comparisonwise 
error rate of a = .05, the experimentwise Type I error rate would be 1 — (1 — .05)'° = .40. 
In such cases, practitioners look to alternatives that provide better control over the experi- 
mentwise error rate. 

One alternative for controlling the overall experimentwise error rate, referred to as the 
Bonferroni adjustment, involves using a smaller comparisonwise error rate for each test. 
For example, if we want to test C pairwise comparisons and want the maximum probability 
of making a Type I error for the overall experiment to be apy, we simply use a compari- 
sonwise error rate equal to œ~gw/C. In the Chemitech experiment, if we want to use Fisher’s 


'The assumption is that the three tests are independent, and hence the joint probability of the three events can be 
obtained by simply multiplying the individual probabilities. In fact, the three tests are not independent because MSE 
is used in each test; therefore, the error involved is even greater than that shown. 
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LSD procedure to test all three pairwise comparisons with a maximum experimentwise 
error rate of Qpy = .05, we set the comparisonwise error rate to be a = .05/3 = .017. For 
a problem with five populations and 10 possible pairwise comparisons, the Bonferroni 
adjustment would suggest a comparisonwise error rate of .05/10 = .005. Recall from our 
discussion of hypothesis testing in Chapter 9 that for a fixed sample size, any decrease 

in the probability of making a Type I error will result in an increase in the probability of 
making a Type II error, which corresponds to accepting the hypothesis that the two popu- 
lation means are equal when in fact they are not equal. As a result, many practitioners are 
reluctant to perform individual tests with a low comparisonwise Type I error rate because 
of the increased risk of making a Type II error. 

Several other procedures, such as Tukey’s procedure and Duncan’s multiple range test, 
have been developed to help in such situations. However, there is considerable controversy 
in the statistical community as to which procedure is “best.” The truth is that no one proce- 
dure is best for all types of problems. 


EXERCISES 


Methods 
13. The following data are from a completely randomized design. 


Treatment Treatment Treatment 

A B C 

32 44 28 

30 43 36 

30 44 35 

26 46 36 

32 48 40 
Sample mean 30 45 36 
Sample variance 6.00 4.00 6.50 


a. At the a = .05 level of significance, can we reject the null hypothesis that the 
means of the three treatments are equal? 

b. Use Fisher’s LSD procedure to test whether there is a significant difference between 
the means for treatments A and B, treatments A and C, and treatments B and C. Use 
a = .05. 

c. Use Fisher’s LSD procedure to develop a 95% confidence interval estimate of the 
difference between the means of treatments A and B. 

14. The following data are from a completely randomized design. In the following calcula- 
tions, use a = .05. 


Treatment Treatment Treatment 
1 2 3 
63 82 69 
47 72 54 
54 88 61 
40 66 48 
Xj 51 Hi 58 
s2 96.67 OM ail sh SY) 


a. Use analysis of variance to test for a significant difference among the means of the 
three treatments. 
b. Use Fisher’s LSD procedure to determine which means are different. 
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Applications 

15. Testing Chemical Processes. To test whether the mean time needed to mix a batch of 
material is the same for machines produced by three manufacturers, the Jacobs Chemical 
Company obtained the following data on the time (in minutes) needed to mix the material. 


Manufacturer 
1 2 3 
20 28 20 
26 26 19 
24 31 28 
22 2 22 


a. Use these data to test whether the population mean times for mixing a batch of 
material differ for the three manufacturers. Use œ = .05. 

b. At the a = .05 level of significance, use Fisher’s LSD procedure to test for the 
equality of the means for manufacturers | and 3. What conclusion can you draw 
after carrying out this test? 

16. Confidence Intervals for Different Processes. Refer to exercise 15. Use Fisher’s 
LSD procedure to develop a 95% confidence interval estimate of the difference be- 
tween the means for manufacturer 1 and manufacturer 2. 

17. Marketing Ethics. In the digital age of marketing, special care must be taken to ensure 
that programmatic ads appear on websites aligned with a company’s strategy, culture, 
and ethics. For example, in 2017, Nordstrom, Amazon, and Whole Foods each faced 
boycotts from social media users when automated ads for these companies showed up 
on the Breitbart website (ChiefMarketer.com website). It is important for marketing 
professionals to understand a company’s values and culture. The following data are from 
an experiment designed to investigate the perception of corporate ethical values among 
individuals specializing in marketing (higher scores indicate higher ethical values). 


Marketing Managers Marketing Research Advertising 
6 5 6 
5 5 7 
4 4 6 
5 4 5 
6 5 6 
4 4 6 


a. Use a = .05 to test for significant differences in perception among the three groups. 
b. At the a = .05 level of significance, we can conclude that there are differences in the 
perceptions for marketing managers, marketing research specialists, and advertising 
specialists. Use the procedures in this section to determine where the differences occur. 
Use a = .05. 
18. Machine Breakdowns. To test for any significant difference in the number of hours 
between breakdowns for four machines, the following data were obtained. 


Machine 1 Machine 2 Machine 3 Machine 4 
6.4 8.7 ma OE) 
Teo 74 10.3 12.8 
53 9.4 7) 12.1 
7.4 10.1 10.3 10.8 
8.4 92 22 lies; 
73 9.8 8.8 e's) 
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DATA file 


Triple-A 


A completely randomized 
design is useful when 

the experimental units 
are homogeneous. If the 
experimental units are 
heterogeneous, blocking 
is often used to form ho- 


mogeneous groups. 
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a. At the a = .05 level of significance, what is the difference, if any, in the population 
mean times among the four machines? 

b. Use Fisher’s LSD procedure to test for the equality of the means for machines 
2 and 4. Use a .05 level of significance. 

19. Testing Time to Breakdown Between All Pairs of Machines. Refer to exercise 18. 
Use the Bonferroni adjustment to test for a significant difference between all pairs of 
means. Assume that a maximum overall experimentwise error rate of .05 is desired. 

20. Minor League Baseball Attendance. The International League of Triple-A minor 
league baseball consists of 14 teams organized into three divisions: North, South, and 
West. The following data show the average attendance for the 14 teams in the Inter- 
national League (The Biz of Baseball website). Also shown are the teams’ records; 

W denotes the number of games won, L denotes the number of games lost, and PCT is 
the proportion of games played that were won. 


Team Name Division Ww L PCT Attendance 
Buffalo Bisons North 66 77 462 8812 
Lehigh Valley IronPigs North 55 89 382 8479 
Pawtucket Red Sox North 85 58 594 9097 
Rochester Red Wings North 74 70 514 6913 
Scranton-Wilkes Barre Yankees North 88 56 611 7147 
Syracuse Chiefs North 69 73 486 5765 
Charlotte Knights South 63 78 447 4526 
Durham Bulls South 74 70 514 6995 
Norfolk Tides South 64 78 451 6286 
Richmond Braves South 63 78 447 4455 
Columbus Clippers West 69 73 .486 7795 
Indianapolis Indians West 68 76 472 8538 
Louisville Bats West 88 56 611 91152 
Toledo Mud Hens West 75 69 21 8234 


a. Use a = .05 to test for any difference in the mean attendance for the three 
divisions. 

b. Use Fisher’s LSD procedure to determine where the differences occur. Use 
a= .05. 


13.4 Randomized Block Design 


Thus far we have considered the completely randomized experimental design. Recall that 
to test for a difference among treatment means, we computed an F value by using the ratio 
F= M (13.20) 
MSE 

A problem can arise whenever differences due to extraneous factors (ones not con- 
sidered in the experiment) cause the MSE term in this ratio to become large. In such cases, 
the F value in equation (13.20) can become small, signaling no difference among treatment 
means when in fact such a difference exists. 

In this section, we present an experimental design known as a randomized block design. 
Its purpose is to control some of the extraneous sources of variation by removing such varia- 
tion from the MSE term. This design tends to provide a better estimate of the true error vari- 
ance and leads to a more powerful hypothesis test in terms of the ability to detect differences 
among treatment means. To illustrate, let us consider a stress study for air traffic controllers. 
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Air Traffic Controller Stress Test 


A study measuring the fatigue and stress of air traffic controllers resulted in proposals for 
modification and redesign of the controller’s workstation. After consideration of several 
designs for the workstation, three specific alternatives are selected as having the best 
potential for reducing controller stress. The key question is: To what extent do the three 
alternatives differ in terms of their effect on controller stress? To answer this question, we 
need to design an experiment that will provide measurements of air traffic controller stress 
under each alternative. 
Experimental studies in In a completely randomized design, a random sample of controllers would be assigned 
business often involve to each workstation alternative. However, controllers are believed to differ substantially in 
experimental units that are their ability to handle stressful situations. What is high stress to one controller might be 
highly heterogeneous; asa only moderate or even low stress to another. Hence, when considering the within-group 
result, randomized block de- source of variation (MSE), we must realize that this variation includes both random error 
signs are often employed. and error due to individual controller differences. In fact, managers expected controller 
variability to be a major contributor to the MSE term. 
Blocking in experimental One way to separate the effect of the individual differences is to use a randomized block 
design is similar to design. Such a design will identify the variability stemming from individual controller 
stratification in sampling. differences and remove it from the MSE term. The randomized block design calls for a 
single sample of controllers. Each controller in the sample is tested with each of the three 
workstation alternatives. In experimental design terminology, the workstation is the factor 
of interest and the controllers are the blocks. The three treatments or populations associated 
with the workstation factor correspond to the three workstation alternatives. For simplicity, 
we refer to the workstation alternatives as system A, system B, and system C. 

The randomized aspect of the randomized block design is the random order in which 
the treatments (systems) are assigned to the controllers. If every controller were to test the 
three systems in the same order, any observed difference in systems might be due to the 
order of the test rather than to true differences in the systems. 

To provide the necessary data, the three workstation alternatives were installed at the 
Cleveland Control Center in Oberlin, Ohio. Six controllers were selected at random and 
assigned to operate each of the systems. A follow-up interview and a medical examination 
of each controller participating in the study provided a measure of the stress for each con- 
troller on each system. The data are reported in Table 13.5. 

Table 13.6 is a summary of the stress data collected. In this table we include column 
totals (treatments) and row totals (blocks) as well as some sample means that will be helpful 
in making the sum of squares computations for the ANOVA procedure. Because lower stress 
values are viewed as better, the sample data seem to favor system B with its mean stress rat- 
ing of 13. However, the usual question remains: Do the sample results justify the conclusion 
that the population mean stress levels for the three systems differ? That is, are the differences 
statistically significant? An analysis of variance computation similar to the one performed 
for the completely randomized design can be used to answer this statistical question. 


TABLE 13.5 A Randomized Block Design for the Air Traffic Controller 


Stress Test 


Treatments 
System A System B System C 

Controller 1 15 15 18 

æ DATA file Controller 2 14 14 14 
=. Í Block Controller 3 10 11 15 
rene cone Controller 4 13 12 17 
Controller 5 16 13 16 

Controller 6 13 13 18 
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TABLE 13.6 Summary of Stress Data for the Air Traffic Controller Stress Test 


Treatments 
Row or 
System A System B System C | Block Totals Block Means 


. = 48/3 = 16.0 


Controller 1 


Controller 2 X>. = 42/3 = 14.0 
Blocks Controller 3 Te = CS = 12.0 
Controller 4 X4. = 42/3 = 14.0 
Controller 5 X; = 45/3 = 15.0 


Controller 6 Za = SVE = 130 


Column or 
Treatment 
Totals 


Treatment 
Means 


252 Sa 


ANOVA Procedure 


The ANOVA procedure for the randomized block design requires us to partition the sum 
of squares total (SST) into three groups: sum of squares due to treatments (SSTR), sum of 
squares due to blocks (SSBL), and sum of squares due to error (SSE). The formula for this 
partitioning follows. 


SST = SSTR + SSBL + SSE (13.21) 


This sum of squares partition is summarized in the ANOVA table for the randomized block 
design as shown in Table 13.7. The notation used in the table is 


k = the number of treatments 
b = the number of blocks 
nr = the total sample size (n; = kb) 
Note that the ANOVA table also shows how the n, — 1 total degrees of freedom are 


partitioned such that k — 1 degrees of freedom go to treatments, b — 1 go to blocks, and 
(k — 1)(b — 1) go to the error term. The mean square column shows the sum of squares 


TABLE 13.7 ANOVA Table for the Randomized Block Design 


with k Treatments and b Blocks 


Source Sum of Degrees Mean 

of Variation Squares of Freedom Square F p-Value 
SSIR MSTR 

Treatments SSTR k= {| MSTR = k] MSE 

Blocks SSBL I= || MSBL = a 
SSE 

Error SSE (k = 1)(b = 1) MSE = (k— 1b — 1) 

Total SS mp 1 
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divided by the degrees of freedom, and F = MSTR/MSE is the F ratio used to test for a 
significant difference among the treatment means. The primary contribution of the random- 
ized block design is that, by including blocks, we remove the individual controller differ- 
ences from the MSE term and obtain a more powerful test for the stress differences in the 
three workstation alternatives. 


Computations and Conclusions 


To compute the F statistic needed to test for a difference among treatment means with a 
randomized block design, we need to compute MSTR and MSE. To calculate these two 
mean squares, we must first compute SSTR and SSE; in doing so, we will also compute 
SSBL and SST. To simplify the presentation, we perform the calculations in four steps. In 
addition to k, b, and n; as previously defined, the following notation is used. 


x; = value of the observation corresponding to treatment j in block i 
X.; = sample mean of the jth treatment 
x; = sample mean for the ith block 


x = overall sample mean 


Step 1. Compute the total sum of squares (SST). 
SST = > SG — xy (13.22) 
i=1j=1 
Step 2. Compute the sum of squares due a treatments (SSTR). 
SSTR = SG, — xy (13.23) 
j=l 
Step 3. Compute the sum of squares due to blocks (SSBL). 
SSBL = > (,. — xy" (13.24) 
i=l 
Step 4. Compute the sum of squares due to error (SSE). 
SSE = SST — SSTR — SSBL (13.25) 


For the air traffic controller data in Table 13.6, these steps lead to the following sums of 
squares. 


Step 1. SST = (15 — 14? + (15 — 14}? + (18 — 14} +--- + (13 — 14° = 70 
Step 2. SSTR = 6[(13.5 — 14)? + (13.0 — 14} + (15.5 — 14)?] = 21 
Step 3. SSBL = 3[(16 — 14)? + (14 — 14)? + (12 — 14} + (14 — 14}? 
+ (15 — 14)? + (13 — 14] = 30 
Step 4. SSE = 70 — 21 — 30 = 19 


These sums of squares divided by their degrees of freedom provide the corresponding 
mean square values shown in Table 13.8. 


TABLE 13.8 ANOVA Table for the Air Traffic Controller Stress Test 


Source Sum Degrees Mean 

of Variation of Squares ofFreedom Square F p-Value 
Treatments 21 2 10.5 10.5/1.9 = 5.53 1024 
Blocks 30 5 6.0 

Error 19 10 le?) 

Total 70 17 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


13.4 Randomized Block Design 579 


Let us use a level of significance a = .05 to conduct the hypothesis test. The value of 
the test statistic is 


F MSTR 10.5 
MSE 1.9 


The numerator degrees of freedom is k — 1 = 3 — 1 = 2 and the denominator degrees of 
freedom is (k — 1)(b — 1) = (3 — 1)(6 — 1) = 10. Because we will only reject the null 
hypothesis for large values of the test statistic, the p-value is the area under the F distribu- 
tion to the right of F = 5.53. From Table 4 of Appendix B we find that with the degrees of 
freedom 2 and 10, F = 5.53 is between F o5 = 5.46 and F o, = 7.56. As a result, the area 
in the upper tail, or the p-value, is between .01 and .025. Alternatively, we can use Excel 
to show that the p-value for F = 5.53 is .024. With p-value = a = .05, we reject the null 
hypothesis Hj: u, = u, = u, and conclude that the population mean stress levels differ for 
the three workstation alternatives. 

Some general comments can be made about the randomized block design. The 
experimental design described in this section is a complete block design; the word 
“complete” indicates that each block is subjected to all k treatments. That is, all 
controllers (blocks) were tested with all three systems (treatments). Experimental 
designs in which some but not all treatments are applied to each block are referred to 
as incomplete block designs. A discussion of incomplete block designs is beyond the 
scope of this text. 

Because each controller in the air traffic controller stress test was required to use 
all three systems, this approach guarantees a complete block design. In some cases, 
however, blocking is carried out with “similar” experimental units in each block. For 
example, assume that in a pretest of air traffic controllers, the population of controllers 
was divided into groups ranging from extremely high-stress individuals to extremely 
low-stress individuals. The blocking could still be accomplished by having three con- 
trollers from each of the stress classifications participate in the study. Each block would 
then consist of three controllers in the same stress group. The randomized aspect of the 
block design would be the random assignment of the three controllers in each block to 
the three systems. 

Finally, note that the ANOVA table shown in Table 13.7 provides an F value to test for 
treatment effects but not for blocks. The reason is that the experiment was designed to test 
a single factor—workstation design. The blocking based on individual stress differences 
was conducted to remove such variation from the MSE term. However, the study was not 
designed to test specifically for individual differences in stress. 

Some analysts compute F = MSB/MSE and use that statistic to test for significance of 
the blocks. Then they use the result as a guide to whether the same type of blocking would 
be desired in future experiments. However, if individual stress difference is to be a factor in 
the study, a different experimental design should be used. A test of significance on blocks 
should not be performed as a basis for a conclusion about a second factor. 


Excel’s Anova: Two-Factor Without Replication tool can be used to test whether the mean 
stress levels for air traffic controllers are the same for the three systems. 


5.53 


Enter/Access Data: Open the file AirTraffic. The data are in cells B2:D7 and labels are 
in column A and cells B1:D1. 


Apply Tools: The following steps describe how to use Excel’s Anova: Two-Factor Without 
Replication tool to test the hypothesis that the mean stress level score is the same for all 
three systems. 


Step 1. Click the Data tab on the Ribbon 
Step 2. In the Analyze group, click Data Analysis 
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Step 3. Choose Anova: Two-Factor Without Replication from the list of Analysis 
Tools 
Step 4. When the Anova: Two-Factor Without Replication dialog box appears (see 
Figure 13.7): 
Enter A/:D7 in the Input Range: box 
Select the check box for Labels 
Enter .05 in the Alpha: box 
Select Output Range: in the Output options area 
Enter A9 in the Output Range: box (to identify the upper left corner of the 
section of the worksheet where the output will appear) 
Click OK 


The output, titled Anova: Two-Factor Without Replication, appears in cells A9:G30 of 
the worksheet shown in Figure 13.8. Cells Al1:E21 provide a summary of the data. The 
ANOVA table shown in cells A24:G30 is basically the same as the ANOVA table shown in 
Table 13.8. The label Rows corresponds to the blocks in the problem, and the label Columns 
corresponds to the treatments. The Excel output provides the p-value associated with the test 
as well as the critical F value. 

We can use the p-value shown in cell F27, 0.0242, to make the hypothesis testing 
decision. Thus, at the a = .05 level of significance, we reject H, because the p-value = 
.0242 < a = .05. Hence, we conclude that the mean stress scores differ among the 
three systems. 


FIGURE 13.7 | Excel's ANOVA: Two-Factor without Replication Tool Dialog Box for the 
Air Traffic Controller Stress Test 
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Excel's ANOVA: Two-Factor without Replication Tool Output 


for the Air Traffic Controller Stress Test 
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NOTE + COMMENT 


The error degrees of freedom are less for a randomized block small, the potential effects due to blocks can be masked be- 
design than for a completely randomized design because cause of the loss of error degrees of freedom; for large n, the 
b — 1 degrees of freedom are lost for the b blocks. If nis effects are minimized. 


SES 


Methods 
21. Consider the experimental results for the following randomized block design. Make 
the calculations necessary to set up the analysis of variance table. 
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Treatments 
A B C 
1 10 9 8 
2 12 6 
Blocks 3 18 15 14 
4 20 18 18 
5 8 7 


Use a = .05 to test for any significant differences. 

22. The following data were obtained for a randomized block design involving five treat- 
ments and three blocks: SST = 430, SSTR = 310, SSBL = 85. Set up the ANOVA 
table and test for any significant differences. Use a = .05. 


23. An experiment has been conducted for four treatments with eight blocks. Complete the 
following analysis of variance table. 


Source Sum Degrees Mean F 
of Variation of Squares of Freedom Square 
Treatments 900 

Blocks 400 

Error 

Total 1800 


Use a = .05 to test for any significant differences. 


Applications 

24. Auto Tune-Ups. An automobile dealer conducted a test to determine if the time in 
minutes needed to complete a minor engine tune-up depends on whether a comput- 
erized engine analyzer or an electronic analyzer is used. Because tune-up time varies 
among compact, intermediate, and full-sized cars, the three types of cars were used as 
blocks in the experiment. The data obtained follow. 


Analyzer 
Computerized Electronic 
Compact 50 42 
Car Intermediate 55 44 
Full-sized 63 46 


Use a = .05 to test for any significant differences. 

25. Airfares on Travel Websites. Are there differences in airfare depending on which 
travel agency website you utilize? The following data were collected on travel agency 
websites on July 9, 2018. The following table contains the prices in U.S. dollars for a 
one-way ticket between the cities listed on the left for each of the three travel agency 
websites. Here the pairs of cities are the blocks and the treatments are the different 
websites. Use a = .05 to test for any significant differences in the mean price of a 
one-way airline ticket for the three travel agency websites. 
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Website 
Flight From—To Expedia ($) TripAdvisor ($) Priceline ($) 
© DATA file 
— Atlanta to Seattle 176.00 166.00 175.80 
Airfares New York to Los Angeles 195.00 195.00 206.20 
Cleveland to Orlando 77.00 72.00 76.21 
Dallas to Indianapolis 149.00 149.00 148.20 


26. SAT Performance. The Scholastic Aptitude Test (SAT) contains three parts: critical 
reading, mathematics, and writing. Each part is scored on an 800-point scale. Informa- 
tion on test scores for the 2009 version of the SAT is available at the College Board 
website. A sample of SAT scores for six students follows. 


Critical 
Student Reading Mathematics Writing 
. 1 526 534 530 
S DATA file 2 594 590 586 
SATScores 3 465 464 445 
4 561 566 553 
5 436 478 430 
6 430 458 420 
a. Using a .05 level of significance, do students perform differently on the three por- 
tions of the SAT? 
b. Which portion of the test seems to give the students the most trouble? Explain. 
27. Consumer Preferences. In 2018, consumer goods giant Procter and Gamble (P&G) 
had more than 20 brands with more than $1 billion in annual sales (P&G website, 
https://us.pg.com/). How does a company like P&G create so many successful con- 
sumer products? P&G effectively invests in research and development to understand 
what consumers want. One method used to determine consumer preferences is called 
conjoint analysis. Conjoint analysis allows a company to ascertain the utility that a 
respondent in the conjoint study places on a design of a given product. The higher the 
utility, the more valuable a respondent finds the design. Suppose we have conducted a 
conjoint study and have the following estimated utilities (higher is preferred) for each 
of three different designs for a new whitening toothpaste. 
Utilities 
Respondent Design A Design B Design C 
1 24.6 34.6 28.6 
= 1 a a 
= DATA file 4 15.4 26.6 24.9 
Toothpaste 5 20.7 18.5 18.0 
6 41.0 34.2 44.6 
7 27a 227 2T 
8 20.2 22.0 27k 
9 31.6 226 Sill 
10 24.4 292 Zoe 


At the .05 level of significance, test for any significant differences. 
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13.5 Factorial Experiment 


The experimental designs we have considered thus far enable us to draw statistical con- 
clusions about one factor. However, in some experiments we want to draw conclusions 
about more than one variable or factor. A factorial experiment is an experimental design 
that allows simultaneous conclusions about two or more factors. The term factorial is 
used because the experimental conditions include all possible combinations of the factors. 
For example, for a levels of factor A and b levels of factor B, the experiment will involve 
collecting data on ab treatment combinations. In this section, we will show the analysis 
for a two-factor factorial experiment. The basic approach can be extended to experiments 
involving more than two factors. 

As an illustration of a two-factor factorial experiment, we will consider a study in- 
volving the Graduate Management Admissions Test (GMAT), a standardized test used by 
graduate schools of business to evaluate an applicant’s ability to pursue a graduate program 
in that field. Scores on the GMAT range from 200 to 800, with higher scores implying 
higher aptitude. 

In an attempt to improve students’ performance on the GMAT, a major Texas university 
is considering offering the following three GMAT preparation programs. 


1. A three-hour review session covering the types of questions generally asked on the 
GMAT. 

2. A one-day program covering relevant exam material, along with the taking and 
grading of a sample exam. 

3. An intensive 10-week course involving the identification of each student’s weak- 
nesses and the setting up of individualized programs for improvement. 


Hence, one factor in this study is the GMAT preparation program, which has three treat- 
ments: three-hour review, one-day program, and 10-week course. Before selecting the 
preparation program to adopt, further study will be conducted to determine how the pro- 
posed programs affect GMAT scores. 

The GMAT is usually taken by students from three colleges: the College of Business, 
the College of Engineering, and the College of Arts and Sciences. Therefore, a second 
factor of interest in the experiment is whether a student’s undergraduate college affects the 
GMAT score. This second factor, undergraduate college, also has three treatments: busi- 
ness, engineering, and arts and sciences. The factorial design for this experiment with three 
treatments corresponding to factor A, the preparation program, and three treatments corre- 
sponding to factor B, the undergraduate college, will have a total of 3 X 3 = 9 treatment 
combinations. These treatment combinations or experimental conditions are summarized 
in Table 13.9. 

Assume that a sample of two students will be selected corresponding to each of the nine 
treatment combinations shown in Table 13.9: Two business students will take the three- 
hour review, two will take the one-day program, and two will take the 10-week course. In 
addition, two engineering students and two arts and sciences students will take each of the 


TABLE 13.9 Nine Treatment Combinations for the Two-Factor 


GMAT Experiment 


Factor B: College 
Business Engineering Arts and Sciences 


Factor A: Three-hour review 1 2 3 
Preparation One-day program 4 5 6 
Program 10-week course a 8 9 
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TABLE 13.10 GMAT Scores for the Two-Factor Experiment 


Factor B: College 
Business Engineering Arts and Sciences 


: 500 540 480 
Three-hour review 580 460 400 
Ss : Factor A: 
= 
= DATA f ile Preparation One-day program ae pe 
GMATStudy prec 
10-week course 560 600 480 
600 580 410 


three preparation programs. In experimental design terminology, the sample size of two for 
each treatment combination indicates that we have two replications. Additional replica- 
tions and a larger sample size could easily be used, but we elect to minimize the computa- 
tional aspects for this illustration. 

This experimental design requires that 6 students who plan to attend graduate school 
be randomly selected from each of the three undergraduate colleges. Then 2 students from 
each college should be assigned randomly to each preparation program, resulting in a total 
of 18 students being used in the study. 

Let us assume that the randomly selected students participated in the preparation pro- 
grams and then took the GMAT. The scores obtained are reported in Table 13.10. 

The analysis of variance computations with the data in Table 13.10 will provide answers 
to the following questions. 


e Main effect (factor A): Do the preparation programs differ in terms of effect on 
GMAT scores? 

e Main effect (factor B): Do the undergraduate colleges differ in terms of effect on 
GMAT scores? 

e Interaction effect (factors A and B): Do students in some colleges do better on one 
type of preparation program whereas others do better on a different type of prepara- 
tion program? 


The term interaction refers to a new effect that we can now study because we used a 
factorial experiment. If the interaction effect has a significant impact on the GMAT scores, 
we can conclude that the effect of the type of preparation program depends on the under- 
graduate college. 


ANOVA Procedure 


The ANOVA procedure for the two-factor factorial experiment requires us to partition the 
sum of squares total (SST) into four groups: sum of squares for factor A (SSA), sum of 
squares for factor B (SSB), sum of squares for interaction (SSAB), and sum of squares due 
to error (SSE). The formula for this partitioning follows. 


SST = SSA + SSB + SSAB + SSE (13.26) 


The partitioning of the sum of squares and degrees of freedom is summarized in Table 13.11. 
The following notation is used. 


a = number of levels of factor A 
b = number of levels of factor B 
r = number of replications 
nr = total number of observations taken in the experiment; ny = abr 
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TABLE 13.11 ANOVA Table for the Two-Factor Factorial Experiment 


With r Replications 


Source Sum Degrees Mean 

of Variation of Squares of Freedom Square F _p-Value 
SSA MSA 

Factor A SSA a MSA = EN MSE 
SSB MSB 

Factor B SSB lg= | MSB =e MSE 
SSAB MSAB 

i = = MSAB = 

Interaction SSAB E= Vie = i) S G- b1) MSE 
SSE 

Error SSE ab(r — 1) MSE = BE 

Total SST m~ i 


Computations and Conclusions 


To compute the F statistics needed to test for the significance of factor A, factor B, and 
interaction, we need to compute MSA, MSB, MSAB, and MSE. To calculate these four 
mean squares, we must first compute SSA, SSB, SSAB, and SSE; in doing so we will also 
compute SST. To simplify the presentation, we perform the calculations in five steps. In 
addition to a, b, r, and n; as previously defined, the following notation is used. 


= observation corresponding to the kth replicate taken from treatment i 
of factor A and treatment j of factor B 


. = sample mean for the observations in treatment i (factor A) 
x., = sample mean for the observations in treatment j (factor B) 


= sample mean for the observations corresponding to the combination 
of treatment i (factor A) and treatment j (factor B) 


X = overall sample mean of all n; observations 


Step 1. Compute the total sum of squares. 
a b F 
ST=$ >> Ga xP (13.27) 
i=1j=1k=1 


Step 2. Compute the sum of squares for factor A. 


SSA = bry a- xy" (13.28) 
i=1 
Step 3. Compute the sum of squares for factor B. 
SSB = a @.; — rag (13.29) 
j=l 
Step 4. Compute the sum of squares for interaction. 


a b 
SSAB =r. SG, — 3. — zy + 3° (13.30) 


i=1j=1 
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Excel's EDIST.RT function 
makes it easy to compute 
the p-value for each F 
value. For example, for 
factor A, p-value 

= FDIST.RT(1.3829,2,9) 
= .2994. 
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Step 5. Compute the sum of squares due to error. 


SSE = SST — SSA — SSB — SSAB (13.31) 


Table 13.12 reports the data collected in the experiment and the various sums that 
will help us with the sum of squares computations. Using equations (13.27) through 
(13.31), we calculate the following sums of squares for the GMAT two-factor factorial 
experiment. 


Step 1. SST = (500 — 515)* + (580 — 515)? + (540 — 515)? + --- 

+ (410 — 515% = 82,450 
Step 2. SSA = (3)(2)[(493.33 — 515} + (513.33 — 515) 

+ (538.33 — 515)’] = 6100 
Step 3. SSB = (3)(2)[(540 — 515)? + (560 — 515)? + (445 — 515%] = 45,300 
Step 4. SSAB = 2[(540 — 493.33 — 540 + 515)? + (500 — 493.33 

— 560 + 515) +--+ + (445 — 538.33 — 445 + 515)}] = 11,200 

Step 5. SSE = 82,450 — 6100 — 45,300 — 11,200 = 19,850 


These sums of squares divided by their corresponding degrees of freedom provide the 
following mean square values for testing the two main effects (preparation program and 
undergraduate college) and interaction effect. 


A 1 
Factor A: MSA ss a 3050 
a-1 3-1 
SSB 45,300 
Factor B: MSB = = = 22, 
actor S pi 3] 650 
SSAB 11,200 
EAR MSAB = = = 2800 
Interaction: a-b- D G=DG—D 
E 1 
Error: MSE = 39 = D 2205.5556 


ab(r— 1) (3B)2- 1) 


Let us use a level of significance of a = .05 to conduct the hypothesis tests for the two- 
factor GMAT study. The F ratio used to test for differences among preparation programs 
(factor A) is F = MSA/MSE = 3050/2205.5556 = 1.3829. We can use Excel to show that 
the p-value corresponding to F = 1.3829 is .2994. Because the p-value > œ = .05, we 
cannot reject the null hypothesis and must conclude that there is no significant difference 
among the three preparation programs. However, for the undergraduate college (factor B) 
effect, the p-value corresponding to F = MSB/MSE = 22,650/2205.5556 = 10.2695 is 
.0048. Hence, the analysis of variance results enable us to conclude that the GMAT test 
scores do differ among the three undergraduate colleges; that is, the three undergraduate 
colleges do not provide the same preparation for performance on the GMAT. Finally, the 
interaction F value of F = MSAB/MSE = 2800/2205.5556 = 1.2695 and its correspond- 
ing p-value of .3503 mean we cannot identify a significant interaction effect. Therefore, 
we have no reason to believe that the three preparation programs differ in their ability to 
prepare students from the different colleges for the GMAT. The ANOVA table shown in 
Table 13.13 provides a summary of these results for the two-factor GMAT study. 

Undergraduate college was found to be a significant factor. Checking the calculations 
in Table 13.12, we see that the sample means are business students x., = 540, engineer- 
ing students x., = 560, and arts and sciences students x., = 445. Tests on individual 
treatment means can be conducted; yet after reviewing the three sample means, we 
would anticipate no difference in preparation for business and engineering graduates. 
However, the arts and sciences students appear to be significantly less prepared for the 
GMAT than students in the other colleges. Perhaps this observation will lead the uni- 
versity to consider other options for assisting these students in preparing for graduate 
management admission tests. 
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ANOVA Table for the Two-Factor GMAT Study 


Source of Sum of Degrees 

Variation Squares of Freedom Mean Square Ẹ p-Value 
Factor A 6100 2 3,050 3,050/2,205.5556 = 1.3829 2994 
Factor B 45,300 2 22,650 22,650/2,205.5556 = 10.2695 .0048 
Interaction 11,200 4 2,800 2,800/2,205.5556 = 1.2695 3503 
Error 19,850 9 2,205.5556 

Total 82,450 17 


=) 
Excel’s Anova: Two-Factor With Replication tool can be used to analyze the data for the 


two-factor GMAT experiment. Refer to Figures 13.9 and 13.10 as we describe the tasks 
involved. 


Enter/Access Data: Open the file GMATStudy. The data are in cells B2:D7 and labels 
are in cells A2:A7 and A1:D1. 


FIGURE 13.9 | Excel's ANOVA: Two-Factor with Replication Tool Output 
for the GMAT Experiment 


y A B | E | D | E | F 
1 | Preparation | Business Engineering Arts and Sciences 
2 |3-hour review 500 540 480 
3 J 3-hour review 580 460 400 
4 1-day program 460 560 420 
5 |1-day program 540 620 480 
6 10-week course 560 600 480 
7 10-week course 600 580 410 
8 
9 | Anova: Two-Factor With Replication [2 | | es | 
10 Input 
11 | Input Range: | $AS1:$D$7 
12 Rows per sample: | 2 i | 
=| Alpha: | 0.05 Help 
1 5| Output options 
= @ Output Range: $A$9 
17 | ( 3 ) New Worksheet Ply: 
T ( P ) New Workbook 
20 
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Excel's ANOVA: Two-Factor with Replication Tool Output for the 
GMAT Experiment 


w N N 


A 
wi 


Apply Tools: The following steps describe how Excel’s Anova: Two-Factor With Replica- 
tion tool can be used to analyze the data for the two-factor GMAT experiment. 


Step 1. Click the Data tab on the Ribbon 
Step 2. In the Analyze group, click Data Analysis 
Step 3. Choose Anova: Two-Factor With Replication from the list of Analysis Tools 
Step 4. When the Anova: Two-Factor With Replication dialog box appears (see Figure 13.9), 
Enter A/:D7 in the Input Range: box 
Enter 2 in the Rows per sample: box 
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Enter .05 in the Alpha: box 
Select Output Range: in the Output options area 
Enter A9 in the Output Range: box (to identify the upper left corner of 


the section of the worksheet where the output will appear) 
Click OK 


The output, titled Anova: Two-Factor With Replication, appears in cells A9:G44 of 
the worksheet shown in Figure 13.10. Cells Al 1:E34 provide a summary of the data. The 
ANOVA table, shown in cells A37:G44, is basically the same as the ANOVA table shown 
in Table 13.13. The label Sample corresponds to factor A, the label Columns corresponds 
to factor B, and the label Within corresponds to error. The Excel output provides the 
p-value associated with each F test as well as the critical F values. 

We can use the p-value of .2994 in cell F39 to make the hypothesis testing decision 
for factor A (Preparation Program). At the a = .05 level of significance, we cannot reject 
H, because the p-value = .2994 > a@ = .05. We can make the hypothesis testing decision 
for factor B (College) by using the p-value of 0.0048 shown in cell F40. At the a = .05 
level of significance, we reject H} because the p-value = .0048 < a = .05. Finally, the 
interaction p-value of 0.3503 in cell F41 means that we cannot identify a significant inter- 
action effect. 


eee 

Methods 

28. A factorial experiment involving two levels of factor A and three levels of factor B 
resulted in the following data. 


Factor B 
Level 1 Level 2 Level 3 
135 90 75 
Level 1 165 66 93 
Factor A 
25 127 120 
Level 2 95 105 136 


Test for any significant main effects and any interaction. Use a = .05. 

29. The calculations for a factorial experiment involving four levels of factor A, three 
levels of factor B, and three replications resulted in the following data: SST = 280, 
SSA = 26, SSB = 23, SSAB = 175. Set up the ANOVA table and test for any signifi- 
cant main effects and any interaction effect. Use a = .05. 


Applications 

30. Mobile App Website Design. Based on a 2018 study, the average elapsed time 
between when a user navigates to a website on a mobile device until its main 
content is available was 14.6 seconds. This is more than a 20% increase from 2017 
(searchenginejournal.com, https://www.searchenginejournal.com/). Responsiveness 
is certainly an important feature of any website and is perhaps even more impor- 
tant on a mobile device. What other web design factors need to be considered for a 
mobile device to make it more user friendly? Among other things, navigation menu 
placement and amount of text entry required are important on a mobile device. The 
following data provide the time it took (in seconds) randomly selected students 
(two for each factor combination) to perform a prespecified task with the different 
combinations of navigation menu placement and amount of text entry required. 
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Amount of Text Entry Required 


PN Low High 
© DATA file Right 8 12 
MobileApps 12 8 
Navigation Menu Position Middle 22 36 
14 20 
Left 10 18 
18 14 
Use the ANOVA procedure for factorial designs to test for any significant effects res- 
ulting from navigation menu position and amount of text entry required. Use œ = .05. 
31. Amusement Park Queues. An amusement park studied methods for decreasing the 
waiting time (minutes) for rides by loading and unloading riders more efficiently. Two 
alternative loading/unloading methods have been proposed. To account for potential 
differences due to the type of ride and the possible interaction between the method of 
loading and unloading and the type of ride, a factorial experiment was designed. Use 
the following data to test for any significant effect due to the loading and unloading 
method, the type of ride, and interaction. Use a = .05. 
Type of Ride 
Roller Coaster Screaming Demon Log Flume 
41 52 50 
Method 1 43 a4 16 
Method 2 49 50 48 
stnan 51 46 44 
32. Auto Fuel Efficiency. As part of a study designed to compare hybrid and similarly 
equipped conventional vehicles, Consumer Reports tested a variety of classes of 
hybrid and all-gas model cars and sport utility vehicles (SUVs). The following data 
show the miles-per-gallon rating Consumer Reports obtained for two hybrid small 
cars, two hybrid midsize cars, two hybrid small SUVs, and two hybrid midsize 
SUVs; also shown are the miles per gallon obtained for eight similarly equipped 
conventional models. 
Make/Model Class Type MPG 
Honda Civic Small Car Hybrid er 
Honda Civic Small Car Conventional 28 
Toyota Prius Small Car Hybrid 44 
d ; Toyota Corolla Small Car Conventional 32 
= DATA file Chevrolet Malibu Midsize Car Hybrid 27 
HybridTest Chevrolet Malibu Midsize Car Conventional 23 
Nissan Altima Midsize Car Hybrid 32 
Nissan Altima Midsize Car Conventional 25 
Ford Escape Small SUV Hybrid 2 
Ford Escape Small SUV Conventional 21 
Saturn Vue Small SUV Hybrid 28 
Saturn Vue Small SUV Conventional 22 
Lexus RX Midsize SUV Hybrid 23 
Lexus RX Midsize SUV Conventional 1 
Toyota Highlander Midsize SUV Hybrid 24 
Toyota Highlander Midsize SUV Conventional 18 
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At the a = .05 level of significance, test for significant effects due to class, type, and 
interaction. 

33. Tax Research. A study reported in The Accounting Review examined the separate 
and joint effects of two levels of time pressure (low and moderate) and three levels of 
knowledge (naive, declarative, and procedural) on key word selection behavior in tax 
research. Subjects were given a tax case containing a set of facts, a tax issue, and a key 
word index consisting of 1336 key words. They were asked to select the key words 
they believed would refer them to a tax authority relevant to resolving the tax case. 
Prior to the experiment, a group of tax experts determined that the text contained 19 
relevant key words. Subjects in the naive group had little or no declarative or proce- 
dural knowledge, subjects in the declarative group had significant declarative knowl- 
edge but little or no procedural knowledge, and subjects in the procedural group had 
significant declarative knowledge and procedural knowledge. Declarative knowledge 
consists of knowledge of both the applicable tax rules and the technical terms used to 
describe such rules. Procedural knowledge is knowledge of the rules that guide the tax 
researcher’s search for relevant key words. Subjects in the low time pressure situation 
were told they had 25 minutes to complete the problem, an amount of time which 
should be “more than adequate” to complete the case; subjects in the moderate time 
pressure situation were told they would have “only” 11 minutes to complete the case. 
Suppose 25 subjects were selected for each of the six treatment combinations and the 
sample means for each treatment combination are as follows (standard deviations are 
in parentheses). 


Knowledge 
Naive Declarative Procedural 
1.13 1.56 2.00 
Low (1.12) (1.33) (1.54) 
Time Pressure 
0.48 1.68 2.86 
Moderate (0.80) (1.36) (1.80) 


Use the ANOVA procedure to test for any significant differences due to time pressure, 
knowledge, and interaction. Use a .05 level of significance. Assume that the total sum 
of squares for this experiment is 327.50. 


SUMMARY 


In this chapter we showed how analysis of variance can be used to test for differences 
among means of several populations or treatments. We introduced the completely ran- 
domized design, the randomized block design, and the two-factor factorial experiment. 
The completely randomized design and the randomized block design are used to draw 
conclusions about differences in the means of a single factor. The primary purpose of 
blocking in the randomized block design is to remove extraneous sources of variation 
from the error term. Such blocking provides a better estimate of the true error variance 
and a better test to determine whether the population or treatment means of the factor 
differ significantly. 

We showed that the basis for the statistical tests used in analysis of variance and 
experimental design is the development of two independent estimates of the popu- 
lation variance g”. In the single-factor case, one estimator is based on the variation 
between the treatments; this estimator provides an unbiased estimate of o° only if the 
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means Hj, >, - - -, Hy are all equal. A second estimator of g? is based on the variation 
of the observations within each sample; this estimator will always provide an unbiased 
estimate of o°. By computing the ratio of these two estimators (the F statistic) we 
developed a rejection rule for determining whether to reject the null hypothesis that 
the population or treatment means are equal. In all the experimental designs consid- 
ered, the partitioning of the sum of squares and degrees of freedom into their various 
sources enabled us to compute the appropriate values for the analysis of variance cal- 
culations and tests. We also showed how Fisher’s LSD procedure and the Bonferroni 
adjustment can be used to perform pairwise comparisons to determine which means 
are different. 


ie tT akin dh teeddenddauibibeuens 


ANOVA table A table used to summarize the analysis of variance computations and re- 
sults. It contains columns showing the source of variation, the sum of squares, the degrees 
of freedom, the mean square, the F value(s), and the p-value(s). 

Blocking The process of using the same or similar experimental units for all treatments. 
The purpose of blocking is to remove a source of variation from the error term and hence 
provide a more powerful test for a difference in population or treatment means. 
Comparisonwise Type I error rate The probability of a Type I error associated with a 
single pairwise comparison. 

Completely randomized design An experimental design in which the treatments are 
randomly assigned to the experimental units. 

Experimental statistical study A study in which the investigator controls the values of 
one or more variables believed to be related to the outcome of interest, and then mea- 
sures and records the outcome. The investigator’s control over the values of variables 
believed to be related to the outcome of interest allows for possible conclusions about 
whether any of the manipulated variables might have a cause-and-effect relationship 
with the outcome. 

Experimental units The objects of interest in the experiment. 

Experimentwise Type I error rate The probability of making a Type I error on at least 
one of several pairwise comparisons. 

Factor Another word for the independent variable of interest. 

Factorial experiment An experimental design that allows simultaneous conclusions about 
two or more factors. 

Interaction The effect produced when the levels of one factor interact with the levels of 
another factor in influencing the response variable. 

Multiple comparison procedures Statistical procedures that can be used to conduct 
statistical comparisons between pairs of population means. 

Observational study A study in which the investigator observes the outcome of interest 
and possibly values of one or more variables believed to be related to the outcome with- 
out controlling the values of any variables, and then measures and records the outcome. 
The investigator’s lack of control over the values of variables believed to be related to the 
outcome of interest allows only for possible conclusions about associations between the 
outcome and the variables. 

Partitioning The process of allocating the total sum of squares and degrees of freedom to 
the various components. 

Randomized block design An experimental design employing blocking. 

Replications The number of times each experimental condition is repeated in an 
experiment. 

Response variable Another word for the dependent variable of interest. 

Single-factor experiment An experiment involving only one factor with k populations or 
treatments. 

Treatments Different levels of a factor. 
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Uta ea ia ks bon daais 
Completely Randomized Design 


Sample Mean for Treatment j 
x, = (13.1) 


Sample Variance for Treatment j 


oe (13.2) 


Overall Sample Mean 


r= (13.3) 
ny 
n=n tmt + (13.4) 
Mean Square Due to Treatments 
TR 
MSTR = a (13.7) 
k-1 
Sum of Squares Due to Treatments 
k = 
SSTR = YG; =x (13.8) 
j=l 
Mean Square Due to Error 
SSE 
MSE = (13.10) 
nr— k 
Sum of Squares Due to Error 
k 
SSE = ` (n; — Ds? (13.11) 
j=1 
Test Statistic for the Equality of k Population Means 
MSTR 
= 13.12 
MSE ( ) 
Total Sum of Squares 
k n _ 
SST = ` SG; —xy (13.13) 
j=li=1 
Partitioning of Sum of Squares 
SST = SSTR + SSE (13.14) 
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Multiple Comparison Procedures 
Test Statistic for Fisher’s LSD Procedure 


i X; 
t= - (13.16) 
1 1 
mse( + ) 
Ni nj 
Fisher’s LSD 
1 1 
LSD = tp mse( + + 1) (13.17) 
ni n; 
Randomized Block Design 
Total Sum of Squares 
b k 
SST = SS; - x7 (13.22) 
i=1j=1 
Sum of Squares Due to Treatments 
k 
SSTR = b> Ge, — x7 (13.23) 
j=l 
Sum of Squares Due to Blocks 
b 
SSBL = k5 Œ- xy" (13.24) 
i=1 
Sum of Squares Due to Error 
SSE = SST — SSTR — SSBL (13.25) 
Factorial Experiment 
Total Sum of Squares 
a b r 
ST= $$ Sar (13.27) 
i=1j=1k=1 
Sum of Squares for Factor A 
SSA = br) a- xy" (13.28) 
i=l 
Sum of Squares for Factor B 
b 
SSB = arS) x; — xy (13.29) 
jel 
Sum of Squares for Interaction 
a b 
SSAB = rX SG — x, — x +? (13.30) 
i=1j=1 
Sum of Squares for Error 
SSE = SST — SSA — SSB — SSAB (13.31) 


SUPPLEMENTARY EXERCISES 


34. Paper Towel Absorption. In a completely randomized experimental design, three 
brands of paper towels were tested for their ability to absorb water. Equal-size towels 
were used, with four sections of towels tested per brand. The absorbency rating data 
follow. At a .05 level of significance, does there appear to be a difference in the ability 
of the brands to absorb water? 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


Supplementary Exercises 597 


Brand 
x y Z 
91 99 83 
100 96 88 
88 94 89 
89 99 76 


35. Job Satisfaction. A study reported in the Journal of Small Business Management con- 
cluded that self-employed individuals do not experience higher job satisfaction than 
individuals who are not self-employed. In this study, job satisfaction is measured using 
18 items, each of which is rated using a Likert-type scale with 1—5 response options 
ranging from strong agreement to strong disagreement. A higher score on this scale 
indicates a higher degree of job satisfaction. The sum of the ratings for the 18 items, 
ranging from 18 to 90, is used as the measure of job satisfaction. Suppose that this 
approach was used to measure the job satisfaction for lawyers, physical therapists, 
cabinetmakers, and systems analysts. The results obtained for a sample of 10 individu- 
als from each profession follow. 


Lawyer Physical Therapist Cabinetmaker Systems Analyst 
44 55 54 44 
42 78 65 73 
SB . 74 80 79 71 
SS DATA file 42 86 69 60 
SatisJob 53 60 79 64 
50 59 64 66 
45 62 59. 41 
48 52 78 55 
64 55 84 76 
38 50 60 62 
At the a = .05 level of significance, test for any difference in the job satisfaction 
among the four professions. 

36. Monitoring Air Pollution. The U.S. Environmental Protection Agency (EPA) 
monitors levels of pollutants in the air for cities across the country. Ozone pollution 
levels are measured using a 500-point scale; lower scores indicate little health risk, 
and higher scores indicate greater health risk. The following data show the peak levels 
of ozone pollution in four cities (Birmingham, Alabama; Memphis, Tennessee; Little 
Rock, Arkansas; and Jackson, Mississippi) for 10 dates. 

City 
Date Birmingham AL Memphis TN Little Rock AR = Jackson MS 
: Jan 9 18 20 18 14 
= DATA file Jan17 23 31 22 30 
OzoneLevels Jan 18 19 25 22 21 
Jan 31 29 36 28 35 
Feb 1 27. 31 28 24 
Feb 6 26 Sil Si 25 
Feb 14 3] 24 9 25 
Feb 17 Sil Sil 28 28 
Feb 20 33 35 35 34 
Feb 29 20 42 42 21 
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Use a = .05 to test for any significant difference in the mean peak ozone levels among 
the four cities. 

37. College Attendance. The following data show the percentage of 17- to 24-year-olds 
who are attending college in several metropolitan statistical areas in four geographic 
regions of the United States (U.S. Census Bureau website). 


Northeast Midwest South West 
28.6 36.7 59.9 16.4 
SY) 33.4 37 2 335 
Sieg) 22.8 28.0 223 
46.3 43.8 41.1 12.4 
2325 32n Sow AST 
14.9 58.3 18.8 26.8 
36.8 Sill 30.3 573 
36.3 64.0 67.4 14.3 
377. 27.6 32.6 37.0 
58.4 55:5 30.0 28.1 

d i 60.6 78.8 il 1725 
S= DATA file 42.2 29.7 323 
CollegeRates 74.7 29.8 52.4 
36.5 297 51:5 
28.7 34.0 254 
60.4 24.5 29.6 
58.2 54.2 27.6 
PX) 10) SH siS 
28.8 Ay ©) 228 
255 70.2 34.6 
USG) 22 3910 
36.8 307 3740 
28.4 30.8 33.8 
27.2 PAS 28.7 
318 Silos) 21.8 
56.8 38.2 
28.3 40.2 
83:3 35.4 
39.4 21.6 
392 3575 
26.1 
327 


Use a = .05 to test whether the mean percentage of 17- to 24-year-olds who are at- 
tending college is the same for the four geographic regions. 

38. Assembly Methods. Three different assembly methods have been proposed for a 
new product. A completely randomized experimental design was chosen to determine 
which assembly method results in the greatest number of parts produced per hour, and 
30 workers were randomly selected and assigned to use one of the proposed methods. 
The number of units produced by each worker follows. 
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Method 
A B Cc 
27 29 99 
T 100 94 
93 29 87 
a , 100 55 66 
Ss DATA file 73 97 59 
Assembly 91 91 75 
100 85 84 
86 73 72 
22 90 88 
95 83 86 
Use these data and test to see whether the mean number of parts produced is the same 
with each method. Use a = .05. 
39. Job Automation. A Pew Research study conducted in 2017 found that approximately 
75% of Americans believe that robots and computers might one day do many of 
the jobs currently done by people (Pew Research website, http://www.pewinternet 
.org/2017/10/04/americans-attitudes-toward-a-future-in-which-robots-and-computers 
-can-do-many-human-jobs/). Suppose we have the following data collected from 
nurses, tax auditors, and fast-food workers in which a higher score means the person 
feels his or her job is more likely to be automated. 
Tax Fast-Food 
Nurse Auditor Worker 
4 5 5 
a : 
SS DATA file 5 6 7 
JobAutomation : : 7 
3 7 4 
4 4 6 
5 6 5 
4 5 7 


a. Use a = .05 to test for differences in the belief that a person’s job is likely to be 
automated for the three professions. 

b. Use Fisher’s LSD procedure to compare the belief that a person’s job will be auto- 
mated for nurses and tax auditors. 

40. Fuel Efficiency of Gasoline Brands. A research firm tests the miles-per-gallon char- 
acteristics of three brands of gasoline. Because of different gasoline performance char- 
acteristics in different brands of automobiles, five brands of automobiles are selected 
and treated as blocks in the experiment; that is, each brand of automobile is tested with 
each type of gasoline. The results of the experiment (in miles per gallon) follow. 


Gasoline Brands 
l II lll 


A 18 21 20 
B 24 26 27 
Automobiles C 30 29. 34 
D 22 25 24 
E 20 23 24 
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a. Ata = .05, is there a significant difference in the mean miles-per-gallon character- 
istics of the three brands of gasoline? 

b. Analyze the experimental data using the ANOVA procedure for completely ran- 
domized designs. Compare your findings with those obtained in part (a). What is 
the advantage of attempting to remove the block effect? 

41. Jimmy Kimmel Live! on ABC, The Tonight Show Starring Jimmy Fallon on NBC, and 

The Late Show with Stephen Colbert on CBS are three popular late-night talk shows. 

The following table shows the number of viewers in millions for a 10-week period 

during the spring for each of these shows (TV by the Numbers website). 


The Tonight Show 
Jimmy Kimmel Starring Jimmy The Late Show with 

Week Live (ABC) Fallon (NBC) Stephen Colbert (CBS) 
June 13-June 17 2G 3.24 22m 
June 6-June 10 2.58 322 2.05 
May 30-June 3 2.64 2.66 2.08 
ae l May 23-May 27 2.47 3.30 2.07 
= DATA file May 16-May 20 1.97 3.10 2.31 
TalkShows May 9-May 16 221 3.31 2.45 
May 2—May 6 ON 820 2S7 
April 25-April 29 2.24 315 2.45 
April 18-April 22 2.10 27 2.56 
April 11-April 15 2.21 3.24 ZAG 


At the .05 level of significance, test for a difference in the mean number of viewers per 
week for the three late-night talk shows. 

42. Golf Club Design. Major League Baseball franchises rely on attendance for a large 
share of their total revenue, and weekend games are particularly important. The follow- 
ing table shows the attendance for the Houston Astros for games played during seven 
weekend series for the first three months (April, May, and June) of the 2011 season 


(ESPN website). 
Opponent Friday Saturday Sunday 
> ; Florida Marlins 41,042 25,421 22,299 
— 
S= DATA file San Diego Padres 23755 28,100 22,899 
HoustonAstros Milwaukee Brewers 25,734 26,514 23,908 
New York Mets 28,791 31,140 28,406 
Arizona Diamondbacks 21,834 31,405 21,882 
Atlanta Braves 29252 SOA 23,765 
Tampa Bay Rays 26,682 27,208 23,965 


At the .05 level of significance, test whether the mean attendance is the same for 

these three days. The Houston Astros are considering running a special promotion to 
increase attendance during one game of each weekend series during the second half of 
the season. Do these data suggest a particular day on which the Astros should schedule 
these promotions? 
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43. Language Translation. A factorial experiment was designed to test for any significant 
differences in the time needed to perform English to foreign language translations with 
two computerized language translators. Because the type of language translated was 
also considered a significant factor, translations were made with both systems for three 
different languages: Spanish, French, and German. Use the following data for translation 
time in hours. 


Language 
Spanish French German 
System 1 2 1 12 
12 14 16 
System 2 2 es 1e 
10 16 22 


Test for any significant differences due to language translator, type of language, and 
interaction. Use a = .05. 

44. Defective Parts. A manufacturing company designed a factorial experiment to deter- 
mine whether the number of defective parts produced by two machines differed and 
if the number of defective parts produced also depended on whether the raw material 
needed by each machine was loaded manually or by an automatic feed system. The 
following data give the numbers of defective parts produced. Use a = .05 to test for 
any significant effect due to machine, loading system, and interaction. 


Loading System 


Manual Automatic 
Machine 1 30 30 
34 26 
Machine 2 au ae 
22 28 


CASE PROBLEM 1: WENTWORTH MEDICAL CENTER 


As part of a long-term study of individuals 65 years of age or older, sociologists and 
physicians at the Wentworth Medical Center in upstate New York investigated the rela- 
tionship between geographic location and depression. A sample of 60 individuals, all in 
reasonably good health, was selected; 20 individuals were residents of Florida, 20 were 
residents of New York, and 20 were residents of North Carolina. Each of the individuals 
sampled was given a standardized test to measure depression. The data collected follow; 
higher test scores indicate higher levels of depression. These data are contained in the 
file Medical]. 

A second part of the study considered the relationship between geographic location and 
depression for individuals 65 years of age or older who had a chronic health condition such 
as arthritis, hypertension, and/or heart ailment. A sample of 60 individuals with such con- 
ditions was identified. Again, 20 were residents of Florida, 20 were residents of New York, 
and 20 were residents of North Carolina. The levels of depression recorded for this study 
follow. These data are contained in the file Medical2. 
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Data from Medical1 Data from Medical2 
North North 
Florida New York Carolina Florida New York Carolina 
3 8 10 13 14 10 
if 11 7 12 2 12 
if 9 3 17 15 15 
3 7 5 17 12 18 
8 8 11 20 16 12 
8 7 8 ZAl 24 14 
8 8 4 16 18 17 
5 4 3 14 14 8 
5 133 Uf 13 15 14 
2 10 8 17 1 16 
6 6 8 12 20 18 
2 8 7 9 ft 17 
6 12 3 12 23 19 
6 8 9 15 19 15 
9 6 8 16 17 13 
7 8 12 15 14 14 
5 5 6 13 9) 11 
4 7 3 10 14 12 
7 7 8 11 13 13 
3 8 11 i7 ld 11 


Managerial Report 


1. Use descriptive statistics to summarize the data from the two studies. What are your 
preliminary observations about the depression scores? 

2. Use analysis of variance on both data sets. State the hypotheses being tested in each 
case. What are your conclusions? 

3. Use inferences about individual treatment means where appropriate. What are your 
conclusions? 


M2: COMPENSATION FOR SALES 
L 


Suppose that a local chapter of sales professionals in the greater San Francisco area con- 
ducted a survey of its membership to study the relationship, if any, between the years of 
experience and salary for individuals employed in inside and outside sales positions. On 
the survey, respondents were asked to specify one of three levels of years of experience: 
low (1-10 years), medium (11-20 years), and high (21 or more years). A portion of the 
data obtained follows. The complete data set, consisting of 120 observations, is contained 
in the file SalesSalary. 


Observation Salary $ Position Experience 
1 53,938 Inside Medium 
2 52,694 Inside Medium 
3 70,515 Outside Low 
4 527031 Inside Medium 
5 62,283 Outside Low 
6 57,718 Inside Low 
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7 79,081 Outside High 

8 48,621 Inside Low 

9) 72,835 Outside High 
10 54,768 Inside Medium 

1S 58,080 Inside High 
116 78,702 Outside Medium 
117 83,131 Outside Medium 

118 57,788 Inside High 
119 53,070 Inside Medium 

120 60,259 Outside Low 


Managerial Report 


1. Use descriptive statistics to summarize the data. 

2. Develop a 95% confidence interval estimate of the mean annual salary for all salesper- 
sons, regardless of years of experience and type of position. 

3. Develop a 95% confidence interval estimate of the mean salary for inside salespersons. 

Develop a 95% confidence interval estimate of the mean salary for outside salespersons. 

5. Use analysis of variance to test for any significant differences due to position. Use a 
.05 level of significance, and for now, ignore the effect of years of experience. 

6. Use analysis of variance to test for any significant differences due to years of experience. 
Use a .05 level of significance, and for now, ignore the effect of position. 

7. At the .05 level of significance test for any significant differences due to position, years 
of experience, and interaction. 


> 


CASE PROBLEM 3: TOURISTOPIA TRAVEL 


TourisTopia Travel (Triple T) is an online travel agency that specializes in trips to ex- 
otic locations around the world for groups of ten or more travelers. Triple T’s marketing 
manager has been working on a major revision of the homepage of Triple T’s website. 


BE . The content for the homepage has been selected and the only remaining decisions involve 

== DATA file . í . f 

$Œ the selection of the background color (white, green, or pink) and the type of font (Arial, 
TourisTopia Calibri, or Tahoma). 


Triple T’s IT group has designed prototype homepages featuring every combination of 
these background colors and fonts, and it has implemented computer code that will ran- 
domly direct each Triple T website visitor to one of these prototype homepages. For three 
weeks, the prototype homepage to which each visitor was directed and the amount of time 
in seconds spent at Triple T’s website during each visit were recorded. Ten visitors to each 
of the prototype homepages were then selected randomly; the complete data set for these 
visitors is available in the file TourisTopia. 

Triple T wants to use these data to determine if the time spent by visitors to Triple T’s 
website differs by background color or font. It would also like to know if the time spent 
by visitors to the Triple T website differs by different combinations of background color 
and font. 


Managerial Report 
Prepare a managerial report that addresses the following issues. 


1. Use descriptive statistics to summarize the data from Triple T’s study. Based on de- 
scriptive statistics, what are your preliminary conclusions about whether the time spent 
by visitors to the Triple T website differs by background color or font? What are your 
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preliminary conclusions about whether time spent by visitors to the Triple T website 
differs by different combinations of background color and font? 

2. Has Triple T used an observational study or a controlled experiment? Explain. 

3. Use the data from Triple T’s study to test the hypothesis that the time spent by visitors 

to the Triple T website is equal for the three background colors. Include both factors 

and their interaction in the ANOVA model, and use a = .05. 

4. Use the data from Triple T’s study to test the hypothesis that the time spent by visitors 

to the Triple T website is equal for the three fonts. Include both factors and their inter- 

action in the ANOVA model, and use a = .05. 

5. Use the data from Triple T’s study to test the hypothesis that time spent by visitors to 
the Triple T website is equal for the nine combinations of background color and font. 
Include both factors and their interaction in the ANOVA model, and use a = .05. 

6. Do the results of your analysis of the data provide evidence that the time spent by 
visitors to the Triple T website differs by background color, font, or combination of 
background color and font? What is your recommendation? 
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Simple Linear 
Regression 


STATISTICS IN PRACTICE: 
WALMART.COM 


14.1 
Regression Model and Regression Equation 
Estimated Regression Equation 


14.2 
Using Excel to Construct a Scatter Diagram, Display the 
Estimated Regression Line, and Display the Estimated 
Regression Equation 


14.3 
Using Excel to Compute the Coefficient of Determination 
Correlation Coefficient 


14.4 


14.5 
Estimate of o° 
t Test 
Confidence Interval for B, 
F Test 
Some Cautions About the Interpretation of 
Significance Tests 


14.6 


Interval Estimation 
Confidence Interval for the Mean Value of y 
Prediction Interval for an Individual Value of y 


14.7 
Using Excel's Regression Tool for the Armand's Pizza 
Parlors Example 
Interpretation of Estimated Regression Equation Output 
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STATISTICS IN PRACTICE 


Walmart.Com* 
BENTONVILLE, ARKANSAS 


With more than 245 million customers per week 
visiting its 11,000 stores across the globe, Walmart 

is the world's largest retailer. In 2000, in response to 
increasing online shopping, Walmart launched its 
Internet site Walmart.com. To serve its online custom- 
ers, Walmart created a network of distribution centers 
in the United States. One of these distribution centers, 
located in Carrollton, Georgia, was selected as the site 
for a study on how Walmart might better manage its 
packaging for online orders. 

In its peak shipping season (November to December), 
the Walmart distribution center in Carrollton ships over 
100,000 packages per day. The cost of fulfilling an 
order includes the material cost (the cost of the carton 
and packing material—paper that fills the empty space 
when the product is smaller than the volume of the box), 
labor for handling, and the shipping cost. The study 


*Based on S. Ahire, M. Malhotra, and J. Jensen, “Carton-Mix Optimiz- 
ation for Walmart.com Distribution Centers,” Interfaces, Vol. 45, No. 4, 
July-August 2015, pp. 341-357. 
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conducted in Carrollton, called the Carton-mix Op- 
timization Study, had as its objective to determine the 
number and size of cartons to have on hand for shipping 
Walmart’s products to its online customers that would 
minimize raw material, labor, and shipping costs. 

A number of constraints limited the possibilities for 
the variety of cartons to have on hand. For example, 
management provided a minimum and maximum 
carton size. In addition, for boxes automatically con- 
structed (as opposed to manually built), a line would be 
limited to one size carton. 

As with any study of this type, data had to be col- 
lected to build a cost model that could then be used 
to find the optimal carton mix. The material costs for 
existing cartons were known, but what about the cost 
of a carton in a size that is not currently used? Based 
on price quotes from its carton suppliers for a variety 
of sizes, a simple linear regression model was used 
to estimate the cost of any size carton based on its 
volume. The following model was used to estimate the 


14.1 


material cost of a carton in dollars per carton (y) based 
on the volume of the carton measured in cubic inches 
per carton (x): 


y = —0.11 + 0.0014x 


For example, for a carton with volume of 2800 cubic 
inches, we have —0.11 + 0.0014(2800) = 3.81. Therefore, 
the estimated material cost for a carton of 2800 cubic 
inches in volume is $3.81. 

The simple linear regression model was imbedded 
into an optimization algorithm in Microsoft Excel to 
provide Walmart managers a recommendation on the 
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optimal mix of carton sizes to carry at its Carrollton 
distribution center. The optimization model, based 
on this regression, provided documented savings 

of $600,000 in its first year of implementation. The 
model was later deployed across Walmart’s other 
distribution centers with an estimated annual savings 
of $2 million. 

In this chapter, we will study how to estimate a 
simple linear regression model, that is, a linear model in 
which a single variable is used to estimate the value of 
another variable. 


The statistical methods 
used in studying the 
relationship between 
two variables were first 
employed by Sir Francis 
Galton (1822-1911). 
Galton was interested in 
studying the relationship 
between a father’s height 
and the son's height. 
Galton’s disciple, Karl 
Pearson (1857-1936), 
analyzed the relationship 
between the father’s 
height and the son's 
height for 1078 pairs 

of subjects. 


Managerial decisions often are based on the relationship between two or more variables. 
For example, after considering the relationship between advertising expenditures and 
sales, a marketing manager might attempt to predict sales for a given level of advertis- 
ing expenditures. In another case, a public utility might use the relationship between the 
daily high temperature and the demand for electricity to predict electricity usage on the 
basis of next month’s anticipated daily high temperatures. Sometimes a manager will rely 
on intuition to judge how two variables are related. However, if data can be obtained, a 
statistical procedure called regression analysis can be used to develop an equation show- 
ing how the variables are related. 

In regression terminology, the variable being predicted is called the dependent variable. 
The variable or variables being used to predict the value of the dependent variable are 
called the independent variables. For example, in analyzing the effect of advertising 
expenditures on sales, a marketing manager’s desire to predict sales would suggest making 
sales the dependent variable. Advertising expenditure would be the independent variable 
used to help predict sales. In statistical notation, y denotes the dependent variable and x 
denotes the independent variable. 

In this chapter we consider the simplest type of regression analysis involving one inde- 
pendent variable and one dependent variable in which the relationship between the vari- 
ables is approximated by a straight line. It is called simple linear regression. Regression 
analysis involving two or more independent variables is called multiple regression analysis; 
multiple regression and cases involving curvilinear relationships are covered in Chapters 15 
and 16. 


14.1 Simple Linear Regression Model 


Armand’s Pizza Parlors is a chain of Italian-food restaurants located in a five-state area. 
Armand’s most successful locations are near college campuses. The managers believe that 
quarterly sales for these restaurants (denoted by y) are related positively to the size of the 
student population (denoted by x); that is, restaurants near campuses with a large student 
population tend to generate more sales than those located near campuses with a small stu- 
dent population. Using regression analysis, we can develop an equation showing how the 
dependent variable y is related to the independent variable x. 


Regression Model and Regression Equation 


In the Armand’s Pizza Parlors example, the population consists of all the Armand’s restau- 
rants. For every restaurant in the population, there is a value of x (student population) and 
a corresponding value of y (quarterly sales). The equation that describes how y is related 
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to x and an error term is called the regression model. The regression model used in simple 
linear regression follows. 


SIMPLE LINEAR REGRESSION MODEL 
y=B,+ Byxte (14.1) 


Bo and £; are referred to as the parameters of the model, and e (the Greek letter epsilon) is 
a random variable referred to as the error term. The error term accounts for the variability 
in y that cannot be explained by the linear relationship between x and y. 

The population of all Armand’s restaurants can also be viewed as a collection of 
subpopulations, one for each distinct value of x. For example, one subpopulation consists 
of all Armand’s restaurants located near college campuses with 8000 students; another 
subpopulation consists of all Armand’s restaurants located near college campuses with 
9000 students; and so on. Each subpopulation has a corresponding distribution of y values. 
Thus, a distribution of y values is associated with restaurants located near campuses 
with 8000 students; a distribution of y values is associated with restaurants located near 
campuses with 9000 students; and so on. Each distribution of y values has its own mean 
or expected value. The equation that describes how the expected value of y, denoted E(y), 
is related to x is called the regression equation. The regression equation for simple linear 
regression follows. 


SIMPLE LINEAR REGRESSION EQUATION 
EO) = Bo + Bix (14.2) 


The graph of the simple linear regression equation is a straight line; By is the y-intercept of 
the regression line, 6, is the slope, and E(y) is the mean or expected value of y for a given 
value of x. 

Examples of possible regression lines are shown in Figure 14.1. The regression line in 
Panel A shows that the mean value of y is related positively to x, with larger values of E(y) 
associated with larger values of x. The regression line in Panel B shows the mean value of y 
is related negatively to x, with smaller values of E( y) associated with larger values of x. 


FIGURE 14.1 Possible Regression Lines in Simple Linear Regression 


Panel A: Panel B: Panel C: 
Positive Linear Relationship Negative Linear Relationship No Relationship 
EQ) EQ) EQ) 


Intercept 


By 


Regression line 


Slope B, Intercept 


Slope £, is 0 
is negative By 


Slope £, 


Intercept is positive Regression line 


By 


Regression line 
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The value of y provides 
both a point estimate of 
E(y) for a given value of 
x and a prediction of an 
individual value of y for a 


given value of x. 


The estimation of By and 
B; is a statistical process 
much like the estimation of 
u discussed in Chapter 7. 
Bo and B; are the unknown 
parameters of interest, and 
bo and b; are the sample 
statistics used to estimate 


the parameters. 
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The regression line in Panel C shows the case in which the mean value of y is not related to 
x; that is, the mean value of y is the same for every value of x. 


Estimated Regression Equation 


If the values of the population parameters 6, and 6; were known, we could use equa- 

tion (14.2) to compute the mean value of y for a given value of x. In practice, the parameter 
values are not known and must be estimated using sample data. Sample statistics (denoted 
by and b,) are computed as estimates of the population parameters 6, and 6,. Substituting 
the values of the sample statistics bọ and b, for By and £; in the regression equation, we 
obtain the estimated regression equation. The estimated regression equation for simple 
linear regression follows. 


ESTIMATED SIMPLE LINEAR REGRESSION EQUATION 


J= bo t+ dx (14.3) 


Figure 14.2 provides a summary of the estimation process for simple linear regression. 
The graph of the estimated simple linear regression equation is called the estimated 
regression line; by is the y-intercept and b; is the slope. In the next section, we show how 
the least squares method can be used to compute the values of by and b, in the estimated 

regression equation. 

In general, ŷ is the point estimator of E(y), the mean value of y for a given value of x. 
Thus, to estimate the mean or expected value of quarterly sales for all restaurants located 
near campuses with 10,000 students, Armand’s would substitute the value of 10,000 for 
x in equation (14.3). In some cases, however, Armand’s may be more interested in pre- 
dicting sales for one particular restaurant. For example, suppose Armand’s would like to 


FIGURE 14.2 The Estimation Process in Simple Linear Regression 
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pees eee) 
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Regression Model 
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predict quarterly sales for the restaurant they are considering building near Talbot College, 
a school with 10,000 students. As it turns out, the best predictor of y for a given value of x 
is also provided by y. Thus, to predict quarterly sales for the restaurant located near Talbot 
College, Armand’s would also substitute the value of 10,000 for x in equation (14.3). 


NOTES + COMMENTS 


1. Regression analysis cannot be interpreted as a procedure 2. The regression equation in simple linear regression is 


for establishing a cause-and-effect relationship between E(y) = By + B,x. More advanced texts in regression analy- 
variables. It can only indicate how or to what extent vari- sis often write the regression equation as E(y!x) = By + B,x 
ables are associated with each other. Any conclusions about to emphasize that the regression equation provides the 
cause and effect must be based upon the judgment of those mean value of y for a given value of x. 


individuals most knowledgeable about the application. 


14.2 Least Squares Method 


In simple linear regression, The least squares method is a procedure for using sample data to find the estimated 

each observation consists regression equation. To illustrate the least squares method, suppose data were collected 

of two values: one forthe from a sample of 10 Armand’s Pizza Parlor restaurants located near college campuses. For 

independent variable and the ith observation or restaurant in the sample, x; is the size of the student population (in 

one for the dependent thousands) and y; is the quarterly sales (in thousands of dollars). The values of x; and y; for 

variable. the 10 restaurants in the sample are summarized in Table 14.1. We see that restaurant 1, 
with x, = 2 and y, = 58, is near a campus with 2000 students and has quarterly sales of 
$58,000. Restaurant 2, with x, = 6 and y, = 105, is near a campus with 6000 students and 
has quarterly sales of $105,000. The largest sales value is for restaurant 10, which is near a 
campus with 26,000 students and has quarterly sales of $202,000. 

Figure 14.3 is a scatter diagram of the data in Table 14.1. Student population is shown 
on the horizontal axis and quarterly sales is shown on the vertical axis. Scatter diagrams 
for regression analysis are constructed with the independent variable x on the horizontal 
axis and the dependent variable y on the vertical axis. The scatter diagram enables us to 
observe the data graphically and to draw preliminary conclusions about the possible rela- 
tionship between the variables. 


TABLE 14.1 Student Population and Quarterly Sales Data for 10 Armand's 


Pizza Parlors 


Student Quarterly 
Restaurant Population (1000s) Sales ($1000s) 
i Xi yi 
> . 1 2 58 
Sæ DATA file 2 6 105 
Armand’s 3 8 88 
4 8 118 
5 12 ANZ 
6 16 137 
7 20 157, 
8 20 169 
9 22 149 
10 26 202 
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FIGURE 14.3 Scatter Diagram of Student Population and Quarterly 
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What preliminary conclusions can be drawn from Figure 14.3? Quarterly sales appear 
to be higher at campuses with larger student populations. In addition, for these data the 
relationship between the size of the student population and quarterly sales appears to be 
approximated by a straight line; indeed, a positive linear relationship is indicated between x 
and y. We therefore choose the simple linear regression model to represent the relationship 
between quarterly sales and student population. Given that choice, our next task is to use 
the sample data in Table 14.1 to determine the values of by and b; in the estimated simple 
linear regression equation. For the ith restaurant, the estimated regression equation provides 


$= by + bx, (14.4) 
where 

¥, = predicted value of quarterly sales ($1000s) for the ith restaurant 

by = y-intercept of the estimated regression line 

b, = slope of the estimated regression line 


size of the student population (1000s) for the ith restaurant 


x 
II 


L 


With y, denoting the observed (actual) sales for restaurant i and J, in equation (14.4) repre- 
senting the predicted value of sales for restaurant i, every restaurant in the sample will have 
an observed value of sales y; and a predicted value of sales y,. For the estimated regression 
line to provide a good fit to the data, we want the differences between the observed sales 
values and the predicted sales values to be small. 

The least squares method uses the sample data to provide the values of by and b, that 
minimize the sum of the squares of the deviations between the observed values of the de- 
pendent variable y, and the predicted values of the dependent variable ¥,. The criterion for 
the least squares method is given by expression (14.5). 
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LEAST SQUARES CRITERION 


Carl Friedrich Gauss my (ye y) (14.5) 
(1777-1855) proposed the 


least squares method. where 


y; = observed value of the dependent variable for the ith observation 
predicted value of the dependent variable for the ith observation 


=< 
II 


Differential calculus can be used to show (see Appendix 14.1) that the values of by and b; 
that minimize expression (14.5) can be found by using equations (14.6) and (14.7). 


SLOPE AND y-INTERCEPT FOR THE ESTIMATED REGRESSION EQUATION’ 


In computing b, with a SEE = On = ») 
b= : : (14.6) 
calculator, carry as many 1 SCE xX 
significant digits as pos- i 
sible in the intermediate bo =y — bx (14.7) 


calculations. We recom- 
mend carrying at least four where 


significant digits. x; = value of the independent variable for the ith observation 
y; = value of the dependent variable for the ith observation 
x = mean value for the independent variable 
y = mean value for the dependent variable 


n = total number of observations 


Some of the calculations necessary to develop the least squares estimated regression equa- 
tion for Armand’s Pizza Parlors are shown in Table 14.2. With the sample of 10 restaurants, 
we have n = 10 observations. Because equations (14.6) and (14.7) require x and y we begin 
the calculations by computing x and y. 


x; 140 
x= > = — = ]4 
n 10 
— >y; 1300 
= = —— = 130 
y n 10 


Using equations (14.6) and (14.7) and the information in Table 14.2, we can compute the 
slope and intercept of the estimated regression equation for Armand’s Pizza Parlors. The 
calculation of the slope (b,) proceeds as follows. 


SG; — YO; - y) 
b = a 
YG; x) 
2840 
568 
=5 


1An alternate formula for b, is 
_ Dxy; — Coxdy)/n 
Ex = (Sx)?/n 


This form of equation (14.6) is often recommended when using a calculator to compute b}. 


1 
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TABLE 14.2 Calculations for the Least Squares Estimated Regression 


Equation for Armand's Pizza Parlors 


Restaurant i Xi yi XX y, -y (x xly;- y) (x - x)? 
1 2 58 =|2 72 864 144 
2 6 105 8) =25 200 64 
3 8 88 —6 —42 252 36 
4 8 118 =6 2 72 36 
5 12 We = 2 B 26 4 
6 16 I7 2 7 14 4 
7 20 157 6 27 162 36 
8 20 169 6 39 234 36 
9 22 149 8 19 152 64 
10 26 202 12 72 864 144 
Totals 140 1300 2840 568 
= Sy; 2 = Ky -y) 2 — x) 


The calculation of the y-intercept (b,) follows. 
by, =y — bx 
= 130 — 5(14) 
= 60 
Thus, the estimated regression equation is 
y= 60 + 5x 


Figure 14.4 shows the graph of this equation on the scatter diagram. 


FIGURE 14.4 Graph of the Estimated Regression Equation for Armand's Pizza 
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The slope of the estimated regression equation (b, = 5) is positive, implying that as 
student population increases, sales increase. In fact, we can conclude (based on sales mea- 
sured in $1000s and student population in 1000s) that an increase in the student population 
of 1000 is associated with an increase of $5000 in expected sales; that is, quarterly sales 

Using the estimated are expected to increase by $5 per student. 

regression equation to If we believe the least squares estimated regression equation adequately describes the rela- 
make predictions outside tionship between x and y, it would seem reasonable to use the estimated regression equation 
the range of the values of to predict the value of y for a given value of x. For example, if we wanted to predict quarterly 
the independent variable sales for a restaurant to be located near a campus with 16,000 students, we would compute 
should be done with A 

caution because outside y = 60 = 5016) ~ i 

that range we cannot be Hence, we would predict quarterly sales of $140,000 for this restaurant. In the following 
sure that the same sections we will discuss methods for assessing the appropriateness of using the estimated 
relationship is valid. regression equation for estimation and prediction. 


Using Excel to Construct a Scatter Diagram, Display 
the Estimated Regression Line, and Display the Estimated 


Regression Equation 


We can use Excel to construct a scatter diagram, display the estimated regression line, and 
display the estimated regression equation for the Armand's Pizza Parlors data appearing in 
Table 14.1. Refer to Figure 14.5 as we describe the tasks involved. 


Enter/Access Data: Open the file Armand’s. The data are in cells B2:C11 and labels 
appear in column A and cells B1:C1. 


Apply Tools: The following steps describe how to construct a scatter diagram from the 
data in the worksheet. 


Step 1. Select cells B2:C11 
Step 2. Click the Insert tab on the Ribbon 
Step 3. In the Charts group, click the Insert Scatter (X,Y) or Bubble Chart 
Step 4. When the list of scatter diagram subtypes appears: 
Click Scatter (the chart in the upper left corner) 


Editing Options: You can edit the scatter diagram to add a more descriptive chart title, 
add axis titles, and display the trendline and estimated regression equation. For instance, 
suppose you would like to use “Armand’s Pizza Parlors” as the chart title and insert 
“Student Population (1000s)” for the horizontal axis title and “Quarterly Sales ($1000s)” 
for the vertical axis title. 


Step 1. Click the Chart Title and replace it with Armand’s Pizza Parlors 
Step 2. Click the Chart Elements button + (located next to the top right corner of 
the chart) 
Step 3. When the list of chart elements appears: 
Click Axis Titles (creates placeholders for the axis titles) 
Click Gridlines (to deselect the Gridlines option) 
Click Trendline 
Step 4. Click the horizontal Axis Title and replace it with Student Population (1000s) 
Step 5. Click the Vertical (Value) Axis Title and replace it with Quarterly Sales ($1000s) 
Step 6. To change the trendline from a dashed line to a solid line, right-click on the 
trendline and select the Format Trendline option 
Step 7. When the Format Trendline dialog box appears: 
Scroll down and select Display Equation on chart 
Click the Fill & Line button > 
In the Dash type box, select Solid 
Close the Format Trendline dialog box 
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FIGURE 14.5 | Scatter Diagram, Estimated Regression Line, and the 
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Estimated Regression Equation for Armand’s Pizza Parlors 
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The worksheet displayed in Figure 14.5 shows the scatter diagram, the estimated regression 
line, and the estimated regression equation. 


NOTE + COMMENT 


The least squares method provides an estimated regression 
equation that minimizes the sum of squared deviations be- 
tween the observed values of the dependent variable y; and 
the predicted values of the dependent variable J. This least 
squares criterion is used to choose the equation that provides 


the best fit. If some other criterion were used, such as mini- 
mizing the sum of the absolute deviations between y; and Y, 
a different equation would be obtained. In practice, the least 
squares method is the most widely used. 
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EXERCISES 


Methods 
1. Given are five observations for two variables, x and y. 


x, {1 2 3 4 5 


L 


yi3 7 5 1 4 


a. Develop a scatter diagram for these data. 

b. What does the scatter diagram developed in part (a) indicate about the relationship 
between the two variables? 

c. Try to approximate the relationship between x and y by drawing a straight line 
through the data. 

d. Develop the estimated regression equation by computing the values of by and b, 
using equations (14.6) and (14.7). 

e. Use the estimated regression equation to predict the value of y when x = 4. 

2. Given are five observations for two variables, x and y. 


Xx; 3. 12 6 20 14 


y; 155 40 55 10 = 15 

a. Develop a scatter diagram for these data. 

b. What does the scatter diagram developed in part (a) indicate about the relationship 
between the two variables? 

c. Try to approximate the relationship between x and y by drawing a straight line 
through the data. 

d. Develop the estimated regression equation by computing the values of by and b, 
using equations (14.6) and (14.7). 

e. Use the estimated regression equation to predict the value of y when x = 10. 

3. Given are five observations collected in a regression study on two variables. 


x; | 2 6 9 13 20 


yi7 18 9 26 23 
a. Develop a scatter diagram for these data. 
b. Develop the estimated regression equation for these data. 
c. Use the estimated regression equation to predict the value of y when x = 6. 


Applications 
4. Retail and Trade: Female Managers. The following data give the percentage of 
women working in five companies in the retail and trade industry. The percentage of 
management jobs held by women in each company is also shown. 


% Working 167 45 B 54 6l 
% Management | 49 21 65 47 33 


a. Develop a scatter diagram for these data with the percentage of women working in 
the company as the independent variable. 

b. What does the scatter diagram developed in part (a) indicate about the relationship 
between the two variables? 

c. Try to approximate the relationship between the percentage of women working in the 
company and the percentage of management jobs held by women in that company. 

d. Develop the estimated regression equation by computing the values of by and b}. 

e. Predict the percentage of management jobs held by women in a company that has 
60% women employees. 

5. Production Line Speed and Quality Control. Brawdy Plastics, Inc., produces plastic 
seat belt retainers for General Motors at their plant in Buffalo, New York. After final 
assembly and painting, the parts are placed on a conveyor belt that moves the parts 
past a final inspection station. How fast the parts move past the final inspection station 
depends upon the line speed of the conveyor belt (feet per minute). Although faster 
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line speeds are desirable, management is concerned that increasing the line speed too 
much may not provide enough time for inspectors to identify which parts are actually 
defective. To test this theory, Brawdy Plastics conducted an experiment in which the 
same batch of parts, with a known number of defective parts, was inspected using a 
variety of line speeds. The following data were collected. 


Number of 

Line Defective 
Speed Parts Found 

20 23 

20 2i 

30 19 

30 16 

40 15 

40 7 

50 14 

50 sla 


a. Develop a scatter diagram with the line speed as the independent variable. 

b. What does the scatter diagram developed in part (a) indicate about the relationship 
between the two variables? 

c. Use the least squares method to develop the estimated regression equation. 

d. Predict the number of defective parts found for a line speed of 25 feet per minute. 

6. Passing and Winning in the NFL. The National Football League (NFL) records a 
variety of performance data for individuals and teams. To investigate the importance of 
passing on the percentage of games won by a team, the following data show the aver- 
age number of passing yards per attempt (Yds/Att) and the percentage of games won 
(WinPct) for a random sample of 10 NFL teams for the 2011 season (NFL website). 


Team Yds/Att WinPct 
Arizona Cardinals 6.5 50 
Atlanta Falcons TaN 63 
=< DATA file Carolina Panthers 74 38 
NFLPassing Chicago Bears 6.4 50 
Dallas Cowboys 7.4 50 
New England Patriots 8.3 81 
Philadelphia Eagles 7.4 50 
Seattle Seahawks 6.1 44 
St. Louis Rams 5.2 13 
Tampa Bay Buccaneers 6.2 25 


a. Develop a scatter diagram with the number of passing yards per attempt on the 
horizontal axis and the percentage of games won on the vertical axis. 

b. What does the scatter diagram developed in part (a) indicate about the relationship 
between the two variables? 

c. Develop the estimated regression equation that could be used to predict the percent- 
age of games won given the average number of passing yards per attempt. 

d. Provide an interpretation for the slope of the estimated regression equation. 

e. For the 2011 season, the average number of passing yards per attempt for the Kansas 
City Chiefs was 6.2. Use the estimated regression equation developed in part (c) 
to predict the percentage of games won by the Kansas City Chiefs. (Note: For the 
2011 season the Kansas City Chiefs’ record was 7 wins and 9 losses.) Compare your 
prediction to the actual percentage of games won by the Kansas City Chiefs. 
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DATA file 


Sales 


(@ 


DATA file 


BrokerRatings 
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7. Sales Experience and Performance. A sales manager collected the following data on 
annual sales for new customer accounts and the number of years of experience for a 
sample of 10 salespersons. 


Years of Annual Sales 
Salesperson Experience ($1000s) 
1 1 80 
2 3 97 
3 4 92 
4 4 102 
5 6 103 
6 8 Wi 
7 10 He 
8 10 123 
9 11 7 
10 18 136 


. Develop a scatter diagram for these data with years of experience as the indepen- 


dent variable. 


. Develop an estimated regression equation that can be used to predict annual sales 


given the years of experience. 


. Use the estimated regression equation to predict annual sales for a salesperson with 


9 years of experience. 


. Broker Satisfaction. The American Association of Individual Investors (AATI) 
On-Line Discount Broker Survey polls members on their experiences with discount 
brokers. As part of the survey, members were asked to rate the quality of the speed of 
execution with their broker as well as provide an overall satisfaction rating for elec- 
tronic trades. Possible responses (scores) were no opinion (0), unsatisfied (1), some- 
what satisfied (2), satisfied (3), and very satisfied (4). For each broker summary scores 
were computed by calculating a weighted average of the scores provided by each 
respondent. A portion of the survey results follow (AAII website, February 7, 2012). 


Brokerage Speed Satisfaction 
Scottrade, Inc. 3.4 35 
Charles Schwab Sie} 3.4 
Fidelity Brokerage Services 3.4 39) 
TD Ameritrade 3.6 37 
E*Trade Financial 32 2.9 
Vanguard Brokerage Services 3.8 2.8 
USAA Brokerage Services 3.8 3.6 
Thinkorswim 2.6 2.6 
Wells Fargo Investments 27 2.3 
Interactive Brokers 4.0 4.0 
Zecco.com 25 25 


. Develop a scatter diagram for these data with the speed of execution as the indepen- 


dent variable. 


. What does the scatter diagram developed in part (a) indicate about the relationship 


between the two variables? 


. Develop the least squares estimated regression equation. 
d. Provide an interpretation for the slope of the estimated regression equation. 
. Suppose Zecco.com developed new software to increase their speed of execution rating. 


If the new software is able to increase their speed of execution rating from the current 
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value of 2.5 to the average speed of execution rating for the other 10 brokerage firms 
that were surveyed, what value would you predict for the overall satisfaction rating? 
9. Estimating Landscaping Expenditures. David’s Landscaping has collected data on 
home values (in thousands of $) and expenditures (in thousands of $) on landscaping 
DATA file with the hope of developing a predictive model to help marketing to potential new 
clients. Data for 14 households may be found in the file Landscape. 
a. Develop a scatter diagram with home value as the independent variable. 
b. What does the scatter plot developed in part (a) indicate about the relationship 
between the two variables? 
c. Use the least squares method to develop the estimated regression equation. 
d. For every additional $1000 in home value, estimate how much additional will be 
spent on landscaping. 
e. Use the equation estimated in part (c) to predict the landscaping expenditures for 
a home valued at $575,000. 
10. Age and the Price of Wine. For a particular red wine, the following data show the 
auction price for a 750-milliliter bottle and the age of the wine as of June 2016 


(@ 


Landscape 


(WineX website). 
Age (years) Price ($) 
36 256 
20 142 
a) . 29 212 
Sæ DATA file 33 E 
WinePrices 41 331 
27 Ws 
30 209 
45 297 
34 237 
22 182 
a. Develop a scatter diagram for these data with age as the independent variable. 
b. What does the scatter diagram developed in part (a) indicate about the relationship 
between age and price? 
c. Develop the least squares estimated regression equation. 
d. Provide an interpretation for the slope of the estimated equation. 
11. Laptop Ratings. To help consumers in purchasing a laptop computer, Consumer 
Reports calculates an overall test score for each computer tested based upon rating 
factors such as ergonomics, portability, performance, display, and battery life. Higher 
overall scores indicate better test results. The following data show the average retail price 
and the overall score for ten 13-inch models (Consumer Reports website). 
Price Overall 
Brand & Model ($) Score 
Samsung Ultrabook NP900X3C-A01US 1250 83 
e DATA file Apple MacBook Air MC965LL/A 1300 83 
Computer Apple MacBook Air MD231LL/A 1200 82 
HP ENVY 13-2050nr Spectre XT 950 FY 
Sony VAIO SVS13112FXB 800 77 
Acer Aspire S5-391-9880 Ultrabook 1200 74 
Apple MacBook Pro MD101LL/A 1200 74 
Apple MacBook Pro MD313LL/A 1000 73 
Dell Inspiron 113Z-6591SLV 700 67 
Samsung NP535U3C-A01US 600 63 
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a. Develop a scatter diagram with price as the independent variable. 

b. What does the scatter diagram developed in part (a) indicate about the relationship 
between the two variables? 

c. Use the least squares method to develop the estimated regression equation. 

d. Provide an interpretation of the slope of the estimated regression equation. 

e. Another laptop that Consumer Reports tested is the Acer Aspire S3-951-6646 Ul- 
trabook; the price for this laptop was $700. Predict the overall score for this laptop 
using the estimated regression equation developed in part (c). 

12. Stock Beta. In June 2016, Yahoo Finance reported the beta value for Coca-Cola was 
.82 (Yahoo Finance website). Betas for individual stocks are determined by simple lin- 


For more discussion and ear regression. The dependent variable is the total return for the stock, and the indepen- 
practice estimating stock dent variable is the total return for the stock market, such as the return of the S&P 500. 
betas, see Case 1 at the The slope of this regression equation is referred to as the stock’s beta. Many financial 
end of this chapter. analysts prefer to measure the risk of a stock by computing the stock’s beta value. 


The data contained in the file CocaCola show the monthly percentage returns for 
the S&P 500 and the Coca-Cola Company for August 2015 to May 2016. 


S&P 500 Coca-Cola 
Month % Return % Return 
August = 3 
L] September 8 6 
= DATA file Subbet 0 1 
CocaCola November =2 1 
December 5 0) 
January 0 0 
February 7 8 
March 0 =) 
April 2 o 
May 5 =" 
a. Develop a scatter diagram with the S&P % Return as the independent variable. 
b. What does the scatter diagram developed in part (a) indicate about the relationship 
between the returns of the S&P 500 and those of the Coca-Cola Company? 
c. Develop the least squares estimated regression equation. 
d. Provide an interpretation for the slope of the estimated equation (i.e., the beta). 
e. Is your beta estimate close to .82? If not, why might your estimate be different? 
13. Auditing Itemized Tax Deductions. A large city hospital conducted a study to inves- 
tigate the relationship between the number of unauthorized days that employees are 
absent per year and the distance (miles) between home and work for the employees. A 
sample of 10 employees was selected and the following data were collected. 
Distance to Work Number of Days 
(miles) Absent 
1 8 
3 5 
4 8 
S atAfile 6 7 
Absent 8 6 
10 3 
12 5 
14 2 
14 4 
18 2 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


14.3 


14. GPS Navigators. Consumer Reports conducted extensive tests of GPS-based 
navigators and developed an overall rating based on factors such as ease of use, 
driver information, display, and battery life. The following data show the price 


Coefficient of Determination 


able? Explain. 


work to the number of days absent. 


hospital. 


. Predict the number of days absent for an employee that lives 5 miles from the 


621 


. Develop a scatter diagram for these data. Does a linear relationship appear reason- 


. Develop the least squares estimated regression equation that relates the distance to 


and rating for a sample of 20 GPS units with a 4.3-inch screen that Consumer Reports 
tested (Consumer Reports website). 


(@ 


DATA file 


GPS 


C: 
d. 
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Brand and Model 


Garmin 
Garmin 
Garmin 
Garmin 
Garmin 
Garmin 
Garmin 
Garmin 
Garmin 
Garmin 
Garmin 


Nuvi 3490LMT 
Nuvi 3450 
Nuvi 3790T 
Nuvi 3790LMT 
Nuvi 3750 
Nuvi 2475LT 
Nuvi 2455LT 
Nuvi 2370LT 
Nuvi 2360LT 
Nuvi 2360LMT 
Nuvi 755T 


Motorola Motonab TN565t 
Motorola Motonab TN555 


Garmin 
Garmin 
Garmin 
Garmin 


Nuvi 1350T 
Nuvi 1350LMT 
Nuvi 2300 
Nuvi 1350 


Tom Tom VIA 1435T 


Garmin 
Garmin 


Nuvi 1300 
Nuvi 1300LM 


Price ($) 
400 
330 
350 
400 
250 
230 
160 
270 
250 
220 
260 
200 
200 

150 


Rating 
82 
80 
UI 
HY 
74 
74 
73 
Til 
vA 
Ta 
70 
68 
67 
65 
65 
65 
64 
62 
62 
62 


Develop a scatter diagram with price as the independent variable. 
What does the scatter diagram developed in part (a) indicate about the relationship 
between the two variables? 


Use the least squares method to develop the estimated regression equation. 


Predict the rating for a GPS system with a 4.3-inch screen that has a price of $200. 


For the Armand’s Pizza Parlors example, we developed the estimated regression equation 
ý = 60 + 5x to approximate the linear relationship between the size of the student popu- 
lation x and quarterly sales y. A question now is: How well does the estimated regression 
equation fit the data? In this section, we show that the coefficient of determination pro- 
vides a measure of the goodness of fit for the estimated regression equation. 


For the ith observation, the difference between the observed value of the dependent 


variable, y,, and the predicted value of the dependent variable, ¥,, is called the ith residual. 
The ith residual represents the error in using J; to estimate y; Thus, for the ith observation, 
the residual is y; — ¥,. The sum of squares of these residuals or errors is the quantity that 

is minimized by the least squares method. This quantity, also known as the sum of squares 
due to error, is denoted by SSE. 
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TABLE 14.3 Calculation of SSE for Armand’s Pizza Parlors 


x; = Student y; = Quarterly Predicted Squared 
Population Sales Sales Error Error 
Restaurant i (1000s) ($1000s) y,=60+5x;, y,-y, (y,-y)? 
il 2 58 70 =2 144 
2 6 105 90 15 225 
3 8 88 100 2 144 
4 8 118 100 18 324 
5) 12 117 120 =3 9 
6 16 137 140 =3 2 
7 20 157 160 3 9 
8 20 169 160 9 81 
9 22 149 170 =A] 441 
10 26 202 190 12 144 
SSE = 1530 


SUM OF SQUARES DUE TO ERROR 


SSE = 50; - 5° (14.8) 


The value of SSE is a measure of the error in using the estimated regression equation to 
predict the values of the dependent variable in the sample. 

In Table 14.3 we show the calculations required to compute the sum of squares due 
to error for the Armand’s Pizza Parlors example. For instance, for restaurant | the values 
of the independent and dependent variables are x, = 2 and y, = 58. Using the estimated 
regression equation, we find that the predicted value of quarterly sales for restaurant 1 is 
¥, = 60 + 5(2) = 70. Thus, the error in using ¥, to predict y, for restaurant 1 is y, — ¥, = 
58 — 70 = —12. The squared error, (— 12)? = 144, is shown in the last column of Table 14.3. 
After computing and squaring the residuals for each restaurant in the sample, we sum them to 
obtain SSE = 1530. Thus, SSE = 1530 measures the error in using the estimated regression 
equation y = 60 + 5x to predict sales. 

Now suppose we are asked to develop an estimate of quarterly sales without knowledge 
of the size of the student population. Without knowledge of any related variables, we would 
use the sample mean as an estimate of quarterly sales at any given restaurant. Table 14.2 
showed that for the sales data, $y; = 1300. Hence, the mean value of quarterly sales for 
the sample of 10 Armand’s restaurants is y = Xy/n = 1300/10 = 130. In Table 14.4 we 
show the sum of squared deviations obtained by using the sample mean y = 130 to predict 
the value of quarterly sales for each restaurant in the sample. For the ith restaurant in the 
sample, the difference y; — y provides a measure of the error involved in using y to predict 
sales. The corresponding sum of squares, called the total sum of squares, is denoted SST. 


TOTAL SUM OF SQUARES 
SST = XO; — y} (14.9) 


With SST = 15,730 and 

SSE = 1530, the estimated The sum at the bottom of the last column in Table 14.4 is the total sum of squares for 

regression line providesa | Armand’s Pizza Parlors; it is SST = 15,730. 

much better fit to the data In Figure 14.6 we show the estimated regression line } = 60 + 5x and the line cor- 
than the line y = y. responding to y = 130. Note that the points cluster more closely around the estimated 
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TABLE 14.4 Computation of the Total Sum of Squares for Armand's Pizza Parlors 


x; = Student y; = Quarterly Squared 
Population Sales Deviation Deviation 
Restaurant i (1000s) ($1000s) y,;-y (y; - y? 

1 2 58 I2 5184 
2 6 105 25 625 
3 8 88 —42 1764 
4 8 118 112 144 
5 12 Wy =13 169 
6 16 137 7 49 
7 20 157 27 729; 
8 20 169 39 1521 
2 22 149 19 361 
10 26 202 72 5184 
SST = 15,730 


regression line than they do about the line y = 130. For example, for the 10th restaurant 
in the sample we see that the error is much larger when y = 130 is used to predict y,, than 
when J) = 60 + 5(26) = 190 is used. We can think of SST as a measure of how well the 
observations cluster about the y line and SSE as a measure of how well the observations 
cluster about the y line. 


FIGURE 14.6 Deviations About the Estimated Regression Line and the Line 


y = y for Armand's Pizza Parlors 
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To measure how much the y values on the estimated regression line deviate from y, 
another sum of squares is computed. This sum of squares, called the sum of squares due to 
regression, is denoted SSR. 


SUM OF SQUARES DUE TO REGRESSION 
SSR = SG; -y (14.10) 


From the preceding discussion, we should expect that SST, SSR, and SSE are related. 
Indeed, the relationship among these three sums of squares provides one of the most im- 
portant results in statistics. 


RELATIONSHIP AMONG SST, SSR, AND SSE 
SSR can be thought of SST = SSR + SSE (14.11) 


as the explained portion 
of SST, and SSE can be 
thought of as the unex- 


where 
SST = total sum of squares 


plained portion of SST. SSR = sum of squares due to regression 


SSE = sum of squares due to error 


Equation (14.11) shows that the total sum of squares can be partitioned into two compo- 
nents, the sum of squares due to regression and the sum of squares due to error. Hence, if 
the values of any two of these sum of squares are known, the third sum of squares can be 
computed easily. For instance, in the Armand’s Pizza Parlors example, we already know 
that SSE = 1530 and SST = 15,730; therefore, solving for SSR in equation (14.11), we 
find that the sum of squares due to regression is 


SSR = SST — SSE = 15,730 — 1530 = 14,200 


Now let us see how the three sums of squares, SST, SSR, and SSE, can be used to pro- 
vide a measure of the goodness of fit for the estimated regression equation. The estimated 
regression equation would provide a perfect fit if every value of the dependent variable y, 
happened to lie on the estimated regression line. In this case, y; — f, would be zero for each 
observation, resulting in SSE = 0. Because SST = SSR + SSE, we see that for a perfect fit 
SSR must equal SST, and the ratio (SSR/SST) must equal one. Poorer fits will result in larger 
values for SSE. Solving for SSE in equation (14.11), we see that SSE = SST — SSR. Hence, 
the largest value for SSE (and hence the poorest fit) occurs when SSR = 0 and SSE = SST. 

The ratio SSR/SST, which will take values between zero and one, is used to evaluate the 
goodness of fit for the estimated regression equation. This ratio is called the coefficient of 
determination and is denoted by 7°. 


COEFFICIENT OF DETERMINATION 
SSR 


p= 
SST 


(14.12) 


For the Armand’s Pizza Parlors example, the value of the coefficient of determination is 


_ SSR _ 14,200 
SST 15,730 


r 


= .9027 


When we express the coefficient of determination as a percentage, 7° can be interpreted 
as the percentage of the total sum of squares that can be explained by using the estimated 
regression equation. For Armand’s Pizza Parlors, we can conclude that 90.27% of the total 
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sum of squares can be explained by using the estimated regression equation ŷ = 60 + 5x to 
predict quarterly sales. In other words, 90.27% of the variability in sales can be explained 
by the linear relationship between the size of the student population and sales. We should 
be pleased to find such a good fit for the estimated regression equation. 


Using Excel to Compute the Coefficient of Determination 


In Section 14.2 we showed how Excel can be used to construct a scatter diagram, display the 
estimated regression line, and compute the estimated regression equation for the Armand’s 
Pizza Parlors data appearing in Table 14.1. We will now describe how to compute the coeffi- 
cient of determination using the scatter diagram in Figure 14.5. 


Step 1. Right-click on the trendline and select the Format Trendline option 
Step 2. When the Format Trendline dialog box appears: 
Scroll down and select Display R-squared value on chart 
Close the Format Trendline dialog box 


The worksheet displayed in Figure 14.7 shows the scatter diagram, the estimated regression 
line, and the estimated regression equation. 


FIGURE 14.7 | Using Excel's Chart Tools to Compute the Coefficient 
of Determination for Armand's Pizza Parlors 
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Correlation Coefficient 


In Chapter 3 we introduced the correlation coefficient as a descriptive measure of the 
strength of linear association between two variables, x and y. Values of the correlation co- 
efficient are always between — 1 and +1. A value of +1 indicates that the two variables x 
and y are perfectly related in a positive linear sense. That is, all data points are on a straight 
line that has a positive slope. A value of —1 indicates that x and y are perfectly related in a 
negative linear sense, with all data points on a straight line that has a negative slope. Values 
of the correlation coefficient close to zero indicate that x and y are not linearly related. 

In Section 3.5 we presented the equation for computing the sample correlation coeffi- 
cient. If a regression analysis has already been performed and the coefficient of determina- 
tion 7” computed, the sample correlation coefficient can be computed as follows. 


SAMPLE CORRELATION COEFFICIENT 


r,, — (signiof b,)V Coefficient of determination 


= (sign of b) Vr? oY 


where 


b, = the slope of the estimated regression equation y = by + b,x 


The sign for the sample correlation coefficient is positive if the estimated regression equa- 
tion has a positive slope (b, > 0) and negative if the estimated regression equation has a 
negative slope (b, < 0). 

For the Armand’s Pizza Parlor example, the value of the coefficient of determination 
corresponding to the estimated regression equation » = 60 + 5x is .9027. Because the 
slope of the estimated regression equation is positive, equation (14.13) shows that the sam- 
ple correlation coefficient is +V .9027 = +.9501. With a sample correlation coefficient of 
ry, = +.9501, we would conclude that a strong positive linear association exists between 
x and y. 

In the case of a linear relationship between two variables, both the coefficient of de- 
termination and the sample correlation coefficient provide measures of the strength of the 
relationship. The coefficient of determination provides a measure between zero and one, 
whereas the sample correlation coefficient provides a measure between —1 and +1. Al- 
though the sample correlation coefficient is restricted to a linear relationship between two 
variables, the coefficient of determination can be used for nonlinear relationships and for 
relationships that have two or more independent variables. Thus, the coefficient of determi- 
nation provides a wider range of applicability. 


NOTES + COMMENTS 


1. 


In developing the least squares estimated regression 
equation and computing the coefficient of determination, 
we made no probabilistic assumptions about the error 
term e, and no statistical tests for significance of the rela- 
tionship between x and y were conducted. Larger values 
of r° imply that the least squares line provides a better 
fit to the data; that is, the observations are more closely 
grouped about the least squares line. But, using only r?°, we 
can draw no conclusion about whether the relationship be- 
tween x and y is statistically significant. Such a conclusion 
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must be based on considerations that involve the sample 
size and the properties of the appropriate sampling distri- 
butions of the least squares estimators. 


2. As a practical matter, for typical data found in the social 


sciences, values of r? as low as .25 are often considered 
useful. For data in the physical and life sciences, r? values 
of .60 or greater are often found; in fact, in some cases, 
r? values greater than .90 can be found. In business ap- 
plications, r? values vary greatly, depending on the unique 
characteristics of each application. 
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Methods 
15. The data from exercise | follow. 


y13 7 5 UW 14 


The estimated regression equation for these data is ŷ = .20 + 2.60x. 
a. Compute SSE, SST, and SSR using equations (14.8), (14.9), and (14.10). 
b. Compute the coefficient of determination 7°. Comment on the goodness of fit. 
c. Compute the sample correlation coefficient. 
16. The data from exercise 2 follow. 


The estimated regression equation for these data is ŷ = 68 — 3x. 
a. Compute SSE, SST, and SSR. 
b. Compute the coefficient of determination r°. Comment on the goodness of fit. 
c. Compute the sample correlation coefficient. 
17. The data from exercise 3 follow. 


The estimated regression equation for these data is ŷ = 7.6 + .9x. What percentage of 
the total sum of squares can be accounted for by the estimated regression equation? 
What is the value of the sample correlation coefficient? 


Applications 

18. Consumer Reports: Headphones. The following data show the brand, price ($), 
and the overall score for six stereo headphones that were tested by Consumer Reports 
(Consumer Reports website). The overall score is based on sound quality and effec- 
tiveness of ambient noise reduction. Scores range from 0 (lowest) to 100 (highest). The 
estimated regression equation for these data is § = 23.194 + .318x, where x = price 
($) and y = overall score. 


Brand Price ($) Score 
Bose 180 76 
Skullcandy 150 71 
Koss 95 61 
Phillips/O'Neill 70 56 
Denon 70 40 
ING 35 26 


a. Compute SST, SSR, and SSE. 
b. Compute the coefficient of determination 7%. Comment on the goodness of fit. 
c. What is the value of the sample correlation coefficient? 

19. Sales Experience and Sales Performance. In exercise 7 a sales manager collected the 
following data on x = annual sales and y = years of experience. The estimated regres- 
sion equation for these data is ŷ = 80 + 4x. 
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Annual 
Years of Sales 
Salesperson Experience ($1000s) 
1 1 80 
2 g 
SDATAfile ; ‘ k 
Sales 5 6 103 
6 8 111 
7 10 119 
8 10 123 
9 il 117 
10 Ig 136 
a. Compute SST, SSR, and SSE. 
b. Compute the coefficient of determination r°. Comment on the goodness of fit. 
c. What is the value of the sample correlation coefficient? 

20. The Cost of Lighter Racing Bikes. Bicycling, the world’s leading cycling maga- 
zine, reviews hundreds of bicycles throughout the year. Their “Road-Race” category 
contains reviews of bikes used by riders primarily interested in racing. One of the most 
important factors in selecting a bike for racing is the weight of the bike. The follow- 
ing data show the weight (pounds) and price ($) for 10 racing bikes reviewed by the 
magazine (Bicycling website). 

Brand Weight Price ($) 
RELIES 17.8 2100 
PINARELLO Paris 16.1 6250 
ORBEA Orca GDR 14.9 8370 
=< DATA file EDDY MERCKX EMX-7 15.9 6200 
RacingBicycles BH RC1 Ultegra 172 4000 
BH Ultralight 386 (Sal 8600 
CERVELO S5 Team 16.2 6000 
GIANT TCR Advanced 2 WA 2580 
WILIER TRIESTINA Gran Turismo 176 3400 
SPECIALIZED S-Works Amira SL4 14.1 8000 


a. Use the data to develop an estimated regression equation that could be used to esti- 
mate the price for a bike given the weight. 

b. Compute 7°. Did the estimated regression equation provide a good fit? 
c. Predict the price for a bike that weighs 15 pounds. 

21. Cost Estimation. An important application of regression analysis in accounting is in 
the estimation of cost. By collecting data on volume and cost and using the least squares 
method to develop an estimated regression equation relating volume and cost, an accoun- 
tant can estimate the cost associated with a particular manufacturing volume. Consider the 
following sample of production volumes and total cost data for a manufacturing operation. 


Production Volume (units) Total Cost ($) 
400 4000 
450 5000 
550 5400 
600 5900 
700 6400 
750 7000 
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a. Use these data to develop an estimated regression equation that could be used to 
predict the total cost for a given production volume. 

b. What is the variable cost per unit produced? 

c. Compute the coefficient of determination. What percentage of the variation in total 
cost can be explained by production volume? 

d. The company’s production schedule shows 500 units must be produced next month. 
Predict the total cost for this operation. 

22. Rental Car Revenue. The following data were used to investigate the relationship 
between the number of cars in service (1000s) and the annual revenue ($ millions) for 
six smaller car rental companies (Auto Rental News website). 


Cars Revenue 
Company (1000s) ($ millions) 
U-Save Auto Rental System, Inc. ules 118 
Payless Car Rental System, Inc. 10.0 135 
ACE Rent A Car 9.0 100 
Rent-A-Wreck of America 55 37. 
Triangle Rent-A-Car 4.2 40 
Affordable/Sensible 333 32 


With x = cars in service (1000s) and y = annual revenue ($ millions), the estimated 

regression equation is y = —17.005 + 12.966x. For these data SSE = 1043.03. 

a. Compute the coefficient of determination r’. 

b. Did the estimated regression equation provide a good fit? Explain. 

c. What is the value of the sample correlation coefficient? Does it reflect a strong 
or weak relationship between the number of cars in service and the annual 
revenue? 


14.4 Model Assumptions 


In conducting a regression analysis, we begin by making an assumption about the appropri- 
ate model for the relationship between the dependent and independent variable(s). For the 
case of simple linear regression, the assumed regression model is 


y=Pyot Bixte 


Then the least squares method is used to develop values for by and b,, the estimates 
of the model parameters Bù and £}, respectively. The resulting estimated regression 
equation is 


$= by + bx 


We saw that the value of the coefficient of determination (r°) is a measure of the goodness 
of fit of the estimated regression equation. However, even with a large value of r’, the es- 
timated regression equation should not be used until further analysis of the appropriateness 
of the assumed model has been conducted. An important step in determining whether the 
assumed model is appropriate involves testing for the significance of the relationship. The 
tests of significance in regression analysis are based on the following assumptions about 
the error term e. 
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ASSUMPTIONS ABOUT THE ERROR TERM e IN THE REGRESSION MODEL 


y= Bat Pix te 


1. The error term e€ is a random variable with a mean or expected value of zero; 
that is, E(e) = 0. 
Implication: B, and B, are constants, therefore E(B)) = By and E(B,) = B,; 
thus, for a given value of x, the expected value of y is 


EO) = Bo + Bix (14.14) 


As we indicated previously, equation (14.14) is referred to as the regression 
equation. 

2. The variance of e, denoted by o”, is the same for all values of x. 
Implication: The variance of y about the regression line equals o° and is the 
same for all values of x. 

3. The values of € are independent. 
Implication: The value of e for a particular value of x is not related to the value 
of e for any other value of x; thus, the value of y for a particular value of x is not 
related to the value of y for any other value of x. 

4. The error term € is a normally distributed random variable for all values of x. 
Implication: Because y is a linear function of e, y is also a normally distributed 
random variable for all values of x. 


Figure 14.8 illustrates the model assumptions and their implications; note that in this 
graphical interpretation, the value of E(y) changes according to the specific value of x 


FIGURE 14.8 Assumptions for the Regression Model 


Distribution of 
Distribution of y atx = 30 


y atx = 20 


Distribution of 


EO) = bo + Bix 
x = 30 


Note: The y distributions have the 
same shape at each x value. 
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considered. However, regardless of the x value, the probability distribution of e and hence 
the probability distributions of y are normally distributed, each with the same variance. The 
specific value of the error € at any particular point depends on whether the actual value of y 
is greater than or less than E(y). 

At this point, we must keep in mind that we are also making an assumption or hypothe- 
sis about the form of the relationship between x and y. That is, we assume that a straight 
line represented by 6, + £x is the basis for the relationship between the variables. We 
must not lose sight of the fact that some other model, for instance y = B, + Bx? + e, may 
turn out to be a better model for the underlying relationship. 


14.5 Testing for Significance 


In a simple linear regression equation, the mean or expected value of y is a linear func- 
tion of x: E(y) = B, + x. If the value of B, is zero, E(y) = Bo + (0)x = Bo. In this case, 
the mean value of y does not depend on the value of x and hence we would conclude that 
x and y are not linearly related. Alternatively, if the value of 6, is not equal to zero, we 
would conclude that the two variables are related. Thus, to test for a significant regression 
relationship, we must conduct a hypothesis test to determine whether the value of 6, is 
zero. Two tests are commonly used. Both require an estimate of g?, the variance of € in the 
regression model. 


Estimate of o? 


From the regression model and its assumptions we can conclude that o”, the variance of e, 
also represents the variance of the y values about the regression line. Recall that the devia- 
tions of the y values about the estimated regression line are called residuals. Thus, SSE, the 
sum of squared residuals, is a measure of the variability of the actual observations about 
the estimated regression line. The mean square error (MSE) provides the estimate of o°; 
it is SSE divided by its degrees of freedom. 

With f, = by + b,x, SSE can be written as 


SSE = XQ; gy = +0; — 2% b,x) 


Every sum of squares has associated with it a number called its degrees of freedom. Stat- 
isticians have shown that SSE has n — 2 degrees of freedom because two parameters (6, 
and 6,) must be estimated to compute SSE. Thus, the mean square error is computed by 

dividing SSE by n — 2. MSE provides an unbiased estimator of o°. Because the value of 
MSE provides an estimate of o°, the notation s? is also used. 


MEAN SQUARE ERROR (ESTIMATE OF o°) 


(14.15) 


In Section 14.3 we showed that for the Armand’s Pizza Parlors example, SSE = 1530; 
hence, 


y 1530 
s? = MSE = —— = 191.25 


provides an unbiased estimate of o°. 
To estimate o we take the square root of s*. The resulting value, s, is referred to as the 
standard error of the estimate. 
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STANDARD ERROR OF THE ESTIMATE 


(14.16) 


For the Armand’s Pizza Parlors example, s = MSE = V191.25 = 13.829. In the fol- 
lowing discussion, we use the standard error of the estimate in the tests for a significant 
relationship between x and y. 


t Test 


The simple linear regression model is y = Bọ + B,x + e€. If x and y are linearly related, we 
must have 6, # 0. The purpose of the f test is to see whether we can conclude that B, # 0. 
We will use the sample data to test the following hypotheses about the parameter £}. 


Ay: B, = 9 
H,: B, #0 


If H, is rejected, we will conclude that 6; # 0 and that a statistically significant rela- 
tionship exists between the two variables. However, if Hy cannot be rejected, we will have 
insufficient evidence to conclude that a significant relationship exists. The properties of 
the sampling distribution of b,, the least squares estimator of 8,, provide the basis for the 
hypothesis test. 

First, let us consider what would happen if we used a different random sample for the 
same regression study. For example, suppose that Armand’s Pizza Parlors used the sales 
records of a different sample of 10 restaurants. A regression analysis of this new sample 
might result in an estimated regression equation similar to our previous estimated regres- 
sion equation ŷ = 60 + 5x. However, it is doubtful that we would obtain exactly the same 
equation (with an intercept of exactly 60 and a slope of exactly 5). Indeed, bọ and b,, the 
least squares estimators, are sample statistics with their own sampling distributions. The 
properties of the sampling distribution of b, follow. 


SAMPLING DISTRIBUTION OF b, 


Expected Value 
E(b,) = B i 


Standard Deviation 


PENER (14.17) 


o 

NSS 

Distribution Form 
Normal 


Note that the expected value of b, is equal to 6,, so b, is an unbiased estimator of £. 
Because we do not know the value of ø, we develop an estimate of ø, , denoted s, , by 
estimating o with s in equation (14.17). Thus, we obtain the following estimate of T. 
1 


The standard deviation of 


b; is also referred to as the ESTIMATED STANDARD DEVIATION OF b, 


S 
stanaara error of br. Thus, s, = (14.18) 
Sp, provides an estimate of t WICH = 2) 


the standard error of b,. 
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For Armand’s Pizza Parlors, s = 13.829. Hence, using X(x; — x)’ = 568 as shown in 
Table 14.2, we have 


as the estimated standard deviation of b,. 
The ¢ test for a significant relationship is based on the fact that the test statistic 
b, — B, 


Sb, 


follows a ¢ distribution with n — 2 degrees of freedom. If the null hypothesis is true, then 
Bı = Oandt = b,/s,. 
Let us conduct this test of significance for Armand’s Pizza Parlors at the œ = .01 level 
of significance. The test statistic is 
b, 5 


t=—=——=8.62 
5), -5803 


The ¢ distribution table (Table 2 of Appendix B) shows that with n — 2 = 10-2 = 8 
degrees of freedom, t = 3.355 provides an area of .005 in the upper tail. Thus, the area 
in the upper tail of the ¢ distribution corresponding to the test statistic £ = 8.62 must be 
less than .005. Because this test is a two-tailed test, we double this value to conclude that 
the p-value associated with t = 8.62 must be less than 2(.005) = .01. Using Excel, the 
p-value = .000. Because the p-value is less than a = .01, we reject H, and conclude that 
B, is not equal to zero. This evidence is sufficient to conclude that a significant relation- 
ship exists between student population and quarterly sales. A summary of the f test for 
significance in simple linear regression follows. 


t TEST FOR SIGNIFICANCE IN SIMPLE LINEAR REGRESSION 


Hy: B, = 0 
HB, 70 
TEST STATISTIC 
b, 
t=— (14.19) 
Sh, 
REJECTION RULE 
p-value approach: Reject H, if p-value = a 


Critical value approach: Reject H) if t= =t p O if t= tan 


where t„n is based on a t distribution with n — 2 degrees of freedom. 


Confidence Interval for £, 


The form of a confidence interval for B, is as follows: 
by E tanso, 


The point estimator is b, and the margin of error is f,,.5,. The confidence coefficient 
associated with this interval is 1 — a, and 1,,, is the ¢ value providing an area of a/2 in the 
upper tail of a f distribution with n — 2 degrees of freedom. For example, suppose that we 
wanted to develop a 99% confidence interval estimate of B, for Armand’s Pizza 
Parlors. From Table 2 of Appendix B we find that the ¢ value corresponding to a = .01 
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andn — 2 = 10 — 2 = 8 degrees of freedom is fy); = 3.355. Thus, the 99% confidence 
interval estimate of B, is 


by + typS), = 5 + 3.355(.5803) = 5 + 1.95 


or 3.05 to 6.95. 
In using the ż test for significance, the hypotheses tested were 


Ay: B, = 9 
H,: B, #0 


At the a = .01 level of significance, we can use the 99% confidence interval as an altern- 
ative for drawing the hypothesis testing conclusion for the Armand’s data. Because 0, the 
hypothesized value of 6}, is not included in the confidence interval (3.05 to 6.95), we can 
reject H, and conclude that a significant statistical relationship exists between the size of 
the student population and quarterly sales. In general, a confidence interval can be used to 
test any two-sided hypothesis about f,. If the hypothesized value of 8, is contained in the 
confidence interval, do not reject Hy. Otherwise, reject Hp. 


F Test 


An F test, based on the F probability distribution, can also be used to test for signif- 
icance in regression. With only one independent variable, the F test will provide the 
same conclusion as the f test; that is, if the ¢ test indicates 8, ~ 0 and hence a significant 
relationship, the F test will also indicate a significant relationship. But with more than 
one independent variable, only the F test can be used to test for an overall significant 
relationship. 

The logic behind the use of the F test for determining whether the regression relation- 
ship is statistically significant is based on the development of two independent estimates of 
o”. We explained how MSE provides an estimate of o°. If the null hypothesis Hj: B, = 0 is 
true, the sum of squares due to regression, SSR, divided by its degrees of freedom provides 
another independent estimate of o°. This estimate is called the mean square due to regres- 
sion, or simply the mean square regression, and is denoted MSR. In general, 


SSR 
Regression degrees of freedom 


MSR = 


For the models we consider in this text, the regression degrees of freedom is always 
equal to the number of independent variables in the model: 
SSR 


MSR = : : (14.20) 
Number of independent variables 


Because we consider only regression models with one independent variable in this 
chapter, we have MSR = SSR/1 = SSR. Hence, for Armand’s Pizza Parlors, MSR = 
SSR = 14,200. 

If the null hypothesis (1): 6, = 0) is true, MSR and MSE are two independent estimates 
of o° and the sampling distribution of MSR/MSE follows an F distribution with numer- 
ator degrees of freedom equal to | and denominator degrees of freedom equal to n — 2. 
Therefore, when 6, = 0, the value of MSR/MSE should be close to 1. However, if the null 
hypothesis is false (8, # 0), MSR will overestimate o* and the value of MSR/MSE will be 
inflated; thus, large values of MSR/MSE lead to the rejection of H, and the conclusion that 
the relationship between x and y is statistically significant. 

Let us conduct the F test for the Armand’s Pizza Parlors example. The test statistic is 


F MSR 14,200 
MSE 191.25 


= 74.25 
The F test and the t test 


provide identical results for The F distribution table (Table 4 of Appendix B) shows that with 1 degree of freedom in 
simple linear regression. the numerator and n — 2 = 10 — 2 = 8 degrees of freedom in the denominator, F = 11.26 
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provides an area of .01 in the upper tail. Thus, the area in the upper tail of the F distribution 
corresponding to the test statistic F = 74.25 must be less than .01. Thus, we conclude that 
the p-value must be less than .01. Using Excel, the p-value = .000. Because the p-value is 
less than «œ = .01, we reject H} and conclude that a significant relationship exists between 
the size of the student population and quarterly sales. A summary of the F test for signifi- 
cance in simple linear regression follows. 


F TEST FOR SIGNIFICANCE IN SIMPLE LINEAR REGRESSION 


If Hg is false, MSE still pro- H-B?! 
vides an unbiased estimate H,: 8, #0 
of a? and MSR overesti- 


mates o°. If Hy is true, both DEST SIME Te 
MSE and MSR provide z MSR (14.21) 
unbiased estimates of o°; in MSE 


this case the value of MSR/ REJECTION RULE 

MSE should be close to 1. 
p-value approach: Reject H, if p-value = a 
Critical value approach: Reject Ho if F = F, 


where F, is based on an F distribution with 1 degree of freedom in the numerator and 
n — 2 degrees of freedom in the denominator. 


In Chapter 13 we covered analysis of variance (ANOVA) and showed how an ANOVA 
table could be used to provide a convenient summary of the computational aspects of 
analysis of variance. A similar ANOVA table can be used to summarize the results of the 
F test for significance in regression. Table 14.5 is the general form of the ANOVA table 
for simple linear regression. Table 14.6 is the ANOVA table with the F test computations 


TABLE 14.5 General Form of the ANOVA Table for Simple Linear Regression 
In every analysis of variance 


table, the total sum of Source Sum Degrees Mean 

squares is the sum of the of Variation of Squares of Freedom Square E p-Value 
regression sum of squares SSR MSR 

and the error sum of Regression SSR 1 MSR = 7 F= MSE 

squares; in addition, the SSE 

total degrees of freedom is Error SSE n= MSE = EE 

the sum of the regression Total SST ae 


degrees of freedom and the 


error degrees of freedom. |e 


TABLE 14.6 ANOVA Table for the Armand's Pizza Parlors Problem 


Source Sum Degrees Mean 

of Variation of Squares of Freedom Square F p-Value 
, 14,200 _ 14,200 _ 

Regression 14,200 1 iF emcee 14,200 191.25 ` 74.25 .000 

Error 1530 8 Ee = 191.25 

Total 155730 9 
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performed for Armand’s Pizza Parlors. Regression, Error, and Total are the labels for the 
three sources of variation, with SSR, SSE, and SST appearing as the corresponding sum 
of squares in column 2. The degrees of freedom, 1 for SSR, n — 2 for SSE, and n — 1 for 
SST, are shown in column 3. Column 4 contains the values of MSR and MSE, column 5 
contains the value of F = MSR/MSE, and column 6 contains the p-value corresponding to 
the F value in column 5. Almost all computer printouts of regression analysis include an 
ANOVA table summary of the F test for significance. 


Some Cautions About the Interpretation 
of Significance Tests 


Rejecting the null hypothesis H,: 8, = 0 and concluding that the relationship between 

x and y is significant do not enable us to conclude that a cause-and-effect relationship is 
present between x and y. Concluding a cause-and-effect relationship is warranted only if 
the analyst can provide some type of theoretical justification that the relationship is in fact 


Regression analysis, causal. In the Armand’s Pizza Parlors example, we can conclude that there is a significant 
which can be used to relationship between the size of the student population x and quarterly sales y; moreover, 
identify how variables the estimated regression equation y = 60 + 5x provides the least squares estimate of the 
are associated with one relationship. We cannot, however, conclude that changes in student population x cause 


another, cannot be used as changes in quarterly sales y just because we identified a statistically significant relation- 

evidence of acause-and- ship. The appropriateness of such a cause-and-effect conclusion is left to supporting the- 

effect relationship. oretical justification and to good judgment on the part of the analyst. Armand’s managers 
felt that increases in the student population were a likely cause of increased quarterly sales. 
Thus, the result of the significance test enabled them to conclude that a cause-and-effect 
relationship was present. 

In addition, just because we are able to reject Hy: 6, = 0 and demonstrate statistical 

significance does not enable us to conclude that the relationship between x and y is linear. 
We can state only that x and y are related and that a linear relationship explains a signifi- 
cant portion of the variability in y over the range of values for x observed in the sample. 
Figure 14.9 illustrates this situation. The test for significance calls for the rejection of the 
null hypothesis Hy: 6; = 0 and leads to the conclusion that x and y are significantly related, 
but the figure shows that the actual relationship between x and y is not linear. Although the 


FIGURE 14.9 Example of a Linear Approximation of a Nonlinear Relationship 


y 


Actual = 
relationship / 


Smallest Largest 
x value x value 


— 


Range of x 
values observed 
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linear approximation provided by ŷ = b, + b,x is good over the range of x values observed 
in the sample, it becomes poor for x values outside that range. 

Given a significant relationship, we should feel confident in using the estimated regres- 
sion equation for predictions corresponding to x values within the range of the x values 
observed in the sample. For Armand’s Pizza Parlors, this range corresponds to values of x 
between 2 and 26. Unless other reasons indicate that the model is valid beyond this range, 
predictions outside the range of the independent variable should be made with caution. For 
Armand’s Pizza Parlors, because the regression relationship has been found significant at 


1. 


the .01 level, we should feel confident using it to predict sales for restaurants where the 
associated student population is between 2000 and 26,000. 


NOTES + COMMENTS 


The assumptions made about the error term (Section 14.4) 
are what allow the tests of statistical significance in this 
section. The properties of the sampling distribution of b, 
and the subsequent t and F tests follow directly from these 


assumptions. 


. Do not confuse statistical significance with practical signi- 


ficance. With very large sample sizes, statistically signific- 
ant results can be obtained for small values of b,; in such 
cases, one must exercise care in concluding that the rela- 
tionship has practical significance. We discuss this in more 
detail in Section 14.10. 


. Atest of significance for a linear relationship between x and 


coefficient rą. With pẹ, denoting the population correla- 
tion coefficient, the hypotheses are as follows. 


Ag: Py = 9 
Hipy O 


A significant relationship can be concluded if H, is rejec- 
ted. The details of this test are provided in Appendix 14.2. 
However, the t and F tests presented previously in this sec- 
tion give the same result as the test for significance using 
the correlation coefficient. Conducting a test for sig- 
nificance using the correlation coefficient therefore is not 
necessary if a t or F test has already been conducted. 


y can also be performed by using the sample correlation 


EXERCISES 


Methods 
23. The data from exercise 1 follow. 


x, [12 3 4 5 
yi3 7 511 14 


a. Compute the mean square error using equation (14.15). 

b. Compute the standard error of the estimate using equation (14.16). 

c. Compute the estimated standard deviation of b, using equation (14.18). 
d. Use the ż test to test the following hypotheses (a = .05): 


Hy: B, = 0 
H,: B, #0 
e. Use the F test to test the hypotheses in part (d) at a .05 level of significance. Present 


the results in the analysis of variance table format. 
24. The data from exercise 2 follow. 


x,| 3 12 6 20 14 
y; [55 40 55 10 15 


a. Compute the mean square error using equation (14.15). 
b. Compute the standard error of the estimate using equation (14.16). 
c. Compute the estimated standard deviation of b, using equation (14.18). 
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d. Use the ¢ test to test the following hypotheses (a = .05): 
Ay: B, = 0 
H,: B, #0 
e. Use the F test to test the hypotheses in part (d) at a .05 level of significance. Present 


the results in the analysis of variance table format. 
25. The data from exercise 3 follow. 


x2 6 9 13 20 
y,17 18 9 26 23 


a. What is the value of the standard error of the estimate? 
b. Test for a significant relationship by using the ¢ test. Use a = .05. 
c. Use the F test to test for a significant relationship. Use a = .05. What is your conclusion? 


Applications 
26. Headphones. In exercise 18 the data on price ($) and the overall score for six stereo 
headphones tested by Consumer Reports were as follows (Consumer Reports website). 


Brand Price ($) Score 
Bose 180 76 
Skullcandy 150 71 
Koss 95 61 
Phillips/O’Neill 70 56 
Denon 70 40 
JVC 35 26 


a. Does the ż test indicate a significant relationship between price and the overall 
score? What is your conclusion? Use a = .05. 

b. Test for a significant relationship using the F test. What is your conclusion? Use 
a = .05. 

c. Show the ANOVA table for these data. 

DATA fi le 27. College GPA and Salary. Do students with higher college grade point averages (GPAs) 

earn more than those graduates with lower GPAs (CivicScience)? Consider the college 

GPA and salary data (10 years after graduation) provided in the file GPASalary. 

a. Develop a scatter diagram for these data with college GPA as the independent 
variable. What does the scatter diagram indicate about the relationship between the 
two variables? 

b. Use these data to develop an estimated regression equation that can be used to pre- 
dict annual salary 10 years after graduation given college GPA. 

c. At the .05 level of significance, does there appear to be a significant statistical rela- 
tionship between the two variables? 

DATA fi le 28. Broker Satisfaction Conclusion. In exercise 8, ratings data on x = the quality of the 

speed of execution and y = overall satisfaction with electronic trades provided the 

BrokerRatings estimated regression equation y = .2046 + .9077x (AAII website). At the .05 level of 

significance, test whether speed of execution and overall satisfaction are related. Show 
the ANOVA table. What is your conclusion? 

29. Cost Estimation Conclusion. Refer to exercise 21, where data on production volume 
and cost were used to develop an estimated regression equation relating production 
volume and cost for a particular manufacturing operation. Use a = .05 to test whether 
the production volume is significantly related to the total cost. Show the ANOVA table. 
What is your conclusion? 

30. Rental Car Revenue and Fleet Size. Refer to exercise 9, where the following data were 
used to investigate the relationship between the number of cars in service (1000s) and the 


(@ 


GPASalary 


(@ 
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annual revenue ($ millions) for six smaller car rental companies (Auto Rental News 


website). 
Cars Revenue 
Company (1000s) ($ millions) 
U-Save Auto Rental System, Inc. wS 118 
Payless Car Rental System, Inc. 10.0 135 
ACE Rent A Car 9.0 100 
Rent-A-Wreck of America 5:5 37 
Triangle Rent-A-Car 4.2 40 
Affordable/Sensible 3/3 32 


With x = cars in service (1000s) and y = annual revenue ($ millions), the estimated 
regression equation is » = — 17.005 + 12.966x. For these data SSE = 1043.03 and 
SST = 10,568. Do these results indicate a significant relationship between the number 
of cars in service and the annual revenue? 


& DATA fil 31. Racing Bike Significance of Weight on Cost. In exercise 20, data on x = weight 
= f Le (pounds) and y = price ($) for 10 road-racing bikes provided the estimated regression 
RacingBicycles equation ý = 28,574 — 1439x (Bicycling website). For these data SSE = 7,102,922.54 


and SST = 52,120,800. Use the F test to determine whether the weight for a bike and 
the price are related at the .05 level of significance. 


14.6 Using the Estimated Regression Equation 
for Estimation and Prediction 


When using the simple linear regression model, we are making an assumption about the 
relationship between x and y. We then use the least squares method to obtain the estimated 
simple linear regression equation. If a significant relationship exists between x and y and 
the coefficient of determination shows that the fit is good, the estimated regression equa- 
tion should be useful for estimation and prediction. 

For the Armand’s Pizza Parlors example, the estimated regression equation is y = 60 + 5x. 
At the end of Section 14.1, we stated that ŷ can be used as a point estimator of E(y), the mean 
or expected value of y for a given value of x, and as a predictor of an individual value of y. For 
example, suppose Armand’s managers want to estimate the mean quarterly sales for all res- 
taurants located near college campuses with 10,000 students. Using the estimated regression 
equation ŷ = 60 + 5x, we see that for x = 10 (10,000 students), » = 60 + 5(10) = 110. Thus, 
a point estimate of the mean quarterly sales for all restaurant locations near campuses with 
10,000 students is $110,000. In this case we are using ¥ as the point estimator of the mean 
value of y when x = 10. 

We can also use the estimated regression equation to predict an individual value of y for 
a given value of x. For example, to predict quarterly sales for a new restaurant Armand’s is 
considering building near Talbot College, a campus with 10,000 students, we would com- 
pute y = 60 + 5(10) = 110. Hence, we would predict quarterly sales of $110,000 for such 
a new restaurant. In this case, we are using Ý as the predictor of y for a new observation 
when x = 10. 

When we are using the estimated regression equation to estimate the mean value of y or 
to predict an individual value of y, it is clear that the estimate or prediction depends on 
the given value of x. For this reason, as we discuss in more depth the issues concerning 
estimation and prediction, the following notation will help clarify matters. 


x* = the given value of the independent variable x 


y* = the random variable denoting the possible values of the dependent 
variable y when x = x* 
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E(y*) = the mean or expected value of the dependent variable y when x = x* 


¥* = by + b,x* = the point estimator of E(y*) and the predictor of an 
individual value of y* when x = x* 


To illustrate the use of this notation, suppose we want to estimate the mean value of quar- 
terly sales for all Armand’s restaurants located near a campus with 10,000 students. For this 
case, x* = 10 and E(y*) denotes the unknown mean value of quarterly sales for all restau- 
rants where x* = 10. Thus, the point estimate of E(y*) is provided by y* = 60 + 5(10) = 110, 
or $110,000. But, using this notation, )* = 110 is also the predictor of quarterly sales for 
the new restaurant located near Talbot College, a school with 10,000 students. 


Interval Estimation 


Point estimators and predictors do not provide any information about the precision 

associated with the estimate and/or prediction. For that we must develop confidence 

intervals and prediction intervals. A confidence interval is an interval estimate of 

the mean value of y for a given value of x. A prediction interval is used whenever 
Confidence intervals and we want to predict an individual value of y for a new observation corresponding to a 
prediction intervals show given value of x. Although the predictor of y for a given value of x is the same as the 
the precision of the re- point estimator of the mean value of y for a given value of x, the interval estimates we 
gression results. Narrower obtain for the two cases are different. As we will show, the margin of error is larger for 
intervals provide ahigher a prediction interval. We begin by showing how to develop an interval estimate of the 
degree of precision. mean value of y. 


Confidence Interval for the Mean Value of y 


In general, we cannot expect ¥* to equal E(y*) exactly. If we want to make an inference 
about how close y* is to the true mean value E(y*), we will have to estimate the variance of 
ý*. The formula for estimating the variance of ¥*, denoted by She, is 


T bg, A 14.22) 
a Ss .. 
Soat a- i 
The estimate of the standard deviation of }* is given by the square root of equation (14.22). 
1 dma 
ene S (14.23) 


; n Ee- 


The computational results for Armand’s Pizza Parlors in Section 14.5 provided s = 13.829. 
With x* = 10, x = 14, and X(x; — X = 568, we can use equation (14.23) to obtain 


49.996, | 4 OO = 
eee 10 568 


= 13.829V/.1282 = 4.95 


The general expression for a confidence interval follows. 


CONFIDENCE INTERVAL FOR E(y*) 


The margin of error asso- ae 
ciated with this confidence Y” = Ey joS5* (14.24) 
interval is t./28;*. where the confidence coefficient is 1 — æ and ¢,,,is based on the f distribution with n — 2 


degrees of freedom. 


Using expression (14.24) to develop a 95% confidence interval of the mean quarterly 
sales for all Armand’s restaurants located near campuses with 10,000 students, we need the 
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value of t for a/2 = .025 and n — 2 = 10 — 2 = 8 degrees of freedom. Using Table 2 of 
Appendix B, we have tos = 2.306. Thus, with ý* = 110 and a margin of error of t,.5;* = 
2.306(4.95) = 11.415, the 95% confidence interval estimate is 


110 = 11.415 


In dollars, the 95% confidence interval for the mean quarterly sales of all restaurants near 
campuses with 10,000 students is $110,000 + $11,415. Therefore, the 95% confidence 
interval for the mean quarterly sales when the student population is 10,000 is $98,585 to 
$121,415. 

Note that the estimated standard deviation of ¥* given by equation (14.23) is smallest 
when x* — x = 0. In this case the estimated standard deviation of }* becomes 


1 @—-x V: 
+ = 5 
n DG; = xy n 


This result implies that we can make the best or most precise estimate of the mean value of 
y whenever x* = x. In fact, the further x* is from x, the larger x* — x becomes. As a result, 
the confidence interval for the mean value of y will become wider as x* deviates more 
from x. This pattern is shown graphically in Figure 14.10. 


Prediction Interval for an Individual Value of y 


Instead of estimating the mean value of quarterly sales for all Armand’s restaurants 
located near campuses with 10,000 students, suppose we want to predict quarterly sales 


FIGURE 14.10 Confidence Intervals for the Mean Sales y at Given Values of Student Population x 


y 
220 Upper limit 
200 AE 
A a X o 
o 

180 ae 

160 ~~ Lower limit 
T 
S 
S 
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= eos Confidence 
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= 120 interval limits 
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E 100 = 
a Lg Confidence 
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for a new restaurant Armand’s is considering building near Talbot College, a campus 
with 10,000 students. As noted previously, the predictor of y*, the value of y corre- 
sponding to the given x*, is * = b) + b,x". For the new restaurant located near Talbot 
College, x* = 10 and the prediction of quarterly sales is y* = 60 + 5(10) = 110, or 
$110,000. Note that the prediction of quarterly sales for the new Armand’s restaurant 
near Talbot College is the same as the point estimate of the mean sales for all Armand’s 
restaurants located near campuses with 10,000 students. 

To develop a prediction interval, let us first determine the variance associated with using 
¥* as a predictor of y when x = x*. This variance is made up of the sum of the following 
two components. 


1. The variance of the y* values about the mean E(y*), an estimate of which is given 
by s? 

2. The variance associated with using ¥* to estimate E(y*), an estimate of which is 
given by s$. 


The formula for estimating the variance corresponding to the prediction of the value of y 


when x = x*, denoted seq» is 
Ses = s t s$ 

Le — 3? 

oe ae 
n YQ; — x) 
1 Mt x 

Stee (14.25) 
n DO; = i) 


Hence, an estimate of the standard deviation corresponding to the prediction of the value of 
y* is 


1 Gr =z) 
Spred = S 1+ = T Sua (14.26) 


For Armand’s Pizza Parlors, the estimated standard deviation corresponding to the pre- 
diction of quarterly sales for a new restaurant located near Talbot College, a campus with 
10,000 students, is computed as follows. 


= 13,929, /1+— goer) 
, "10 568 


13.829V 1.282 
14.69 


pred 


The general expression for a prediction interval follows. 


PREDICTION INTERVAL FOR y* 
The margin of error asso- 


R 
ciated with this prediction Y” = lanSpred (14.27) 


interval is tu/2Spred: where the confidence coefficient is 1 — œ and ¢,,, is based on a f distribution with n — 2 
degrees of freedom. 


The 95% prediction interval for quarterly sales for the new Armand’s restaurant located near 
Talbot College can be found using taj) = to; = 2.306 and Sprea = 14.69. Thus, with »* = 
110 and a margin of error of t o25Sprea = 2-306(14.69) = 33.875, the 95% prediction interval is 


pred 


110 + 33.875 
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FIGURE 14.11 Confidence and Prediction Intervals for Sales y at Given Values of Student Population x 
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In dollars, this prediction interval is $110,000 + $33,875 or $76,125 to $143,875. Note 
that the prediction interval for the new restaurant located near Talbot College, a campus 
with 10,000 students, is wider than the confidence interval for the mean quarterly sales 
of all restaurants located near campuses with 10,000 students. The difference reflects the 
fact that we are able to estimate the mean value of y more precisely than we can predict an 
In general, the lines for the individual value of y. 
confidence interval limits Confidence intervals and prediction intervals are both more precise when the value of 
and the prediction interval the independent variable x* is closer to x. The general shapes of confidence intervals and 
limits both have curvature. the wider prediction intervals are shown together in Figure 14.11. 


NOTE + COMMENT 


A prediction interval is used to predict the value of the de- that make up the data in Table 14.1, developing a prediction 
pendent variable y for a new observation. As an illustration, interval for quarterly sales for one of these restaurants does 
we showed how to develop a prediction interval of quarterly not make any sense because we already know the value of 
sales for a new restaurant that Armand’s is considering build- quarterly sales for each of these restaurants. In other words, 
ing near Talbot College, a campus with 10,000 students. The a prediction interval only has meaning for something new, in 
fact that the value of x = 10 is not one of the values of student this case, a new observation corresponding to a particular 
population for the Armand'’s sample data in Table 14.1 is not value of x that may or may not equal one of the values of x in 
meant to imply that prediction intervals cannot be developed the sample. 

for values of x in the sample data. But, for the 10 restaurants 
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EXERCISES 


Methods 
32. The data from exercise 1 follow. 


a. Use equation (14.23) to estimate the standard deviation of }* when x = 4. 
b. Use expression (14.24) to develop a 95% confidence interval for the expected value 


of y when x = 4. 
c. Use equation (14.26) to estimate the standard deviation of an individual value of y 
when x = 4. 


d. Use expression (14.27) to develop a 95% prediction interval for y when x = 4. 
33. The data from exercise 2 follow. 


y; [55 40 55 10 15 


. Estimate the standard deviation of ¥* when x = 8. 

. Develop a 95% confidence interval for the expected value of y when x = 8. 
. Estimate the standard deviation of an individual value of y when x = 8. 

. Develop a 95% prediction interval for y when x = 8. 

34. The data from exercise 3 follow. 


adap 


x, [2 6 9 13 20 
y 17 18 9 26 23 


Develop the 95% confidence and prediction intervals when x = 12. Explain why these 
two intervals are different. 


Applications 
35. Restaurant Lines. Many small restaurants in Portland, Oregon, and other cities across 
the United States do not take reservations. Owners say that with smaller capacity, no- 
DATA fi le shows are costly, and they would rather have their staff focused on customer service 
rather than maintaining a reservation system (pressherald.com). However, it is impor- 
tant to be able to give reasonable estimates of waiting time when customers arrive and 
put their name on the waiting list. The file RestaurantLine contains 40 observations 
of number of people in line ahead of a customer (independent variable x) and actual 
waiting time (dependent variable y). The estimated regression equation is: 
y = 4.35 + 8.81x and MSE = 94.42. 
a. Develop a point estimate for a customer who arrive with three people on the 
wait-list. 
b. Develop a 95% confidence interval for the mean waiting time for a customer who 
arrives with three customers already in line. 
c. Develop a 95% prediction interval for Roger and Sherry Davy’s waiting time if 
there are three customers in line when they arrive. 
d. Discuss the difference between parts (b) and (c). 
36. Sales Performance. In exercise 7, the data on y = annual sales ($1000s) for new cus- 


RestaurantLine 


: tomer accounts and x = number of years of experience for a sample of 10 salespersons 
DATA file y 4 E E 
provided the estimated regression equation » = 80 + 4x. For these data x = 7, 
Sales Sa; — X = 142, and s = 4.6098. 


a. Develop a 95% confidence interval for the mean annual sales for all salespersons 
with nine years of experience. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


14.6 Using the Estimated Regression Equation for Estimation and Prediction 645 


b. The company is considering hiring Tom Smart, a salesperson with nine years of 
experience. Develop a 95% prediction interval of annual sales for Tom Smart. 

c. Discuss the differences in your answers to parts (a) and (b). 

37. Auditing Itemized Deductions. In exercise 13, data were given on the adjusted gross 
income x and the amount of itemized deductions taken by taxpayers. Data were re- 
ported in thousands of dollars. With the estimated regression equation } = 4.68 + .16x, 
the point estimate of a reasonable level of total itemized deductions for a taxpayer with 
an adjusted gross income of $52,500 is $13,080. 

a. Develop a 95% confidence interval for the mean amount of total itemized 
deductions for all taxpayers with an adjusted gross income of $52,500. 

b. Develop a 95% prediction interval estimate for the amount of total itemized 
deductions for a particular taxpayer with an adjusted gross income of $52,500. 

c. If the particular taxpayer referred to in part (b) claimed total itemized deductions 
of $20,400, would the IRS agent’s request for an audit appear to be justified? 

d. Use your answer to part (b) to give the IRS agent a guideline as to the amount of to- 
tal itemized deductions a taxpayer with an adjusted gross income of $52,500 should 
claim before an audit is recommended. 

38. Prediction Intervals for Cost Estimation. Refer to exercise 21, where data on the 
production volume x and total cost y for a particular manufacturing operation were 
used to develop the estimated regression equation = 1246.67 + 7.6x. 

a. The company’s production schedule shows that 500 units must be produced next 
month. Predict the total cost for next month. 

b. Develop a 99% prediction interval for the total cost for next month. 

c. If an accounting cost report at the end of next month shows that the actual pro- 
duction cost during the month was $6000, should managers be concerned about 
incurring such a high total cost for the month? Discuss. 

39. Entertainment Spend. The Wall Street Journal asked Concur Technologies, Inc., 
an expense management company, to examine data from 8.3 million expense 
reports to provide insights regarding business travel expenses. Their analysis of 
the data showed that New York was the most expensive city. The following table 
shows the average daily hotel room rate (x) and the average amount spent on en- 
tertainment (y) for a random sample of 9 of the 25 most-visited U.S. cities. These 
data lead to the estimated regression equation y = 17.49 + 1.0334x. For these data 


SSE = 1541.4. 
Room Rate Entertainment 

City ($) ($) 

Boston 148 161 

Sep š Denver 96 105 
Ss DATA file Nashville 91 101 
BusinessTravel New Orleans 110 142 
Phoenix 90 100 

San Diego 102 120 

San Francisco 136 167 

San Jose 90 140 

Tampa 82 98 


a. Predict the amount spent on entertainment for a particular city that has a daily room 
rate of $89. 

b. Develop a 95% confidence interval for the mean amount spent on entertainment for 
all cities that have a daily room rate of $89. 

c. The average room rate in Chicago is $128. Develop a 95% prediction interval for 
the amount spent on entertainment in Chicago. 
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14.7 Excel’s Regression Tool 


In previous sections of this chapter, we have shown how Excel’s chart tools can be used 
for various tasks in a regression analysis. Excel also has a more comprehensive Regression 
tool. In this section we will illustrate how Excel’s Regression tool can be used to perform 
a complete regression analysis, including statistical tests of significance for the Armand’s 
Pizza Parlors data in Table 14.2. 


Using Excel's Regression Tool for the Armand’s Pizza 


Parlors Example 


Refer to Figures 14.12 and 14.13 as we describe the tasks involved to use Excel’s Regres- 
sion tool to perform the regression analysis computations for the Armand’s data. 


Enter/Access Data: Open the file Armand’s. The data are in cells B2:C11 and labels are 
in Column A and cells B1:C1. 


Apply Tools: The following steps describe how to use Excel’s Regression tool to perform 
the regression analysis computations performed in Sections 14.2-14.5. 


Step 1. Click the Data tab on the Ribbon 
Step 2. In the Analyze group, click Data Analysis 
Step 3. Choose Regression from the list of Analysis Tools 
Step 4. When the Regression dialog box appears (see Figure 14.12): 
Enter C/:C// in the Input Y Range: box 
Enter B/:B// in the Input X Range: box 
Select the check box for Labels 
Select the check box for Confidence Level: 
Enter 99 in the Confidence Level: box 
Select Output Range: in the Output options area 
Enter A/3 in the Output Range: box (to identify the upper left corner of 
the section of the worksheet where the output will appear) 
Click OK 


FIGURE 14.12 | Regression Tool Dialog Box for the Armand’s Pizza Parlors Example 


4 A | B (è D E F G | H I | J 
1 [ Restaurant J! Population Sales 
2 | 1| 2 58 
3 | 2 6 105 EE z E 
4 | 3| 8 88 input 
Sa 4 8 118 Input Y Range: $C$1:$C$11 
6 5 12 117 
7 6 | 16 137 Input X Range: $B$1:$8$11 
8 | 7 20, 157 M)tabels| Constant is Zero 
9 | 8 | 20 169 [V] Confidence Level: [99 1] % 
10 | 9 22 149 
11 10 | 26 oy 202 Output options g o = 
12 | | © Output Range: $A$13 
13 | © New Worksheet Ply: m 
14 | © New Workbook 
1 5 | Residuals 
16 __| Residuals |_| Residual Plots 
17 | Standardized Residuals |_| Line Fit Plots 
18 Normal Probability 
19 Normal Probability Plots 
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FIGURE 14.13 | Regression Tool Output for Armand's Pizza Parlors 


A __B C __D E F G H I J | 

1 Restaurant l Population Sales 

2i] 1 2 58 
3 | 2 6 105 
4 3 8 88 
5 | 4 8 118 
6 5 12 117 
FA 6 16 137 
8 | 7 20 157 
9 | 8 20 169 
10, 9 22 149 
11| 10 26 202 
12 | 


The Excel output canbe The regression output, titled SUMMARY OUTPUT, begins with row 13 in Figure 14.13. 
refor matted to improve Because Excel initially displays the output using standard column widths, many of the 
readability. row and column labels are unreadable. In several places we have reformatted to improve 
readability. We have also reformatted cells displaying numerical values to a maximum 
of four decimal places. Numbers displayed using scientific notation have not been 
modified. Regression output in future figures will be similarly reformatted to improve 
readability. 
The first section of the summary output, entitled Regression Statistics, contains sum- 
mary statistics such as the coefficient of determination (R Square). The second section 
of the output, titled ANOVA, contains the analysis of variance table. The last section of 
the output, which is not titled, contains the estimated regression coefficients and related 
information. Let us begin our interpretation of the regression output with the information 
contained in rows 29 and 30. 


Interpretation of Estimated Regression Equation Output 


Row 29 contains information about the y-intercept of the estimated regression line. Row 
30 contains information about the slope of the estimated regression line. The y-intercept of 
the estimated regression line, bọ = 60, is shown in cell B29, and the slope of the estimated 
regression line, b} = 5, is shown in cell B30. The label Intercept in cell A29 and the label 
Population in cell A30 are used to identify these two values. 

In Section 14.5 we showed that the estimated standard deviation of b, is s, = .5803. 
Cell C30 contains the estimated standard deviation of b,. As we indicated previously, the 
standard deviation of b, is also referred to as the standard error of b,. Thus, s, provides an 
estimate of the standard error of b,. The label Standard Error in cell C28 is Excel’s way 
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of indicating that the value in cell C30 is the estimate of the standard error, or standard 
deviation, of b,. 

In Section 14.5 we stated that the form of the null and alternative hypotheses needed to 
test for a significant relationship between population and sales are as follows: 


HxB, =9 
H,: B, #0 


Recall that the ¢ test for a significant relationship required the computation of the ż statistic, 
t = b,/s,. For the Armand’s data, the value of t that we computed was t = 5/.5803 = 8.62. 
Note that after rounding, the value in cell D30 is 8.62. The label in cell D28, t Stat, reminds 
us that cell D30 contains the value of the ż test statistic. 


t Test The information in cell E30 provides a means for conducting a test of significance. 
The value in cell E30 is the p-value associated with the ¢ test for significance. Excel has 
displayed the p-value using scientific notation. To obtain the decimal equivalent, we move 
the decimal point 5 places to the left; we obtain a p-value of .0000255. Thus, the p-value 
associated with the ż test for significance is .0000255. Given the level of significance a, 
the decision of whether to reject H, can be made as follows: 


Reject Hp if p-value = a 


Suppose the level of significance is a = .01. Because the p-value = .0000255 < a = .01, 
we can reject H, and conclude that we have a significant relationship between student 
population and sales. Because p-values are provided as part of the computer output for 
regression analysis, the p-value approach is most often used for hypothesis tests in regres- 
sion analysis. 

The information in cells F28:I30 can be used to develop confidence interval estimates 
of the y-intercept and slope of the estimated regression equation. Excel always provides 
the lower and upper limits for a 95% confidence interval. Recall that in the Regression 
dialog box (see Figure 14.12) we selected Confidence Level and entered 99 in the Confi- 
dence Level box. As a result, Excel’s Regression tool also provides the lower and upper 
limits for a 99% confidence interval. For instance, the value in cell H30 is the lower limit 
for the 99% confidence interval estimate of 6, and the value in cell 130 is the upper limit. 
Thus, after rounding, the 99% confidence interval estimate of B, is 3.05 to 6.95. The 
values in cells F30 and G30 provide the lower and upper limits for the 95% confidence 
interval. Thus, the 95% confidence interval is 3.66 to 6.34. 


Interpretation of ANOVA Output 


The information in cells A22:F26 summarizes the analysis of variance computations for 
the Armand’s data. The three sources of variation are labeled Regression, Residual, and 

Excel refers to the error Total. The label df in cell B23 stands for degrees of freedom, the label SS in cell C23 

sum of squares as the stands for sum of squares, and the label MS in cell D23 stands for mean square. Look- 

residual sum of squares. ing at cells C24:C26, we see that the regression sum of squares is 14200, the residual or 
error sum of squares is 1530, and the total sum of squares is 15730. The values in cells 
B24:B26 are the degrees of freedom corresponding to each sum of squares. Thus, the 
regression sum of squares has | degree of freedom, the residual or error sum of squares 
has 8 degrees of freedom, and the total sum of squares has 9 degrees of freedom. As we 
discussed previously, the regression degrees of freedom plus the residual degrees of free- 
dom are equal to the total degrees of freedom, and the regression sum of squares plus the 
residual sum of squares are equal to the total sum of squares. 

In Section 14.5 we stated that the mean square error, obtained by dividing the error 
or residual sum of squares by its degrees of freedom, provides an estimate of o°. The 
value in cell D25, 191.25, is the mean square error for the Armand’s regression output. 
We also stated that the mean square regression is the sum of squares due to regression 
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divided by the regression degrees of freedom. The value in cell D24, 14200, is the mean 
square regression. 


F Test In Section 14.5 we showed that an F test, based upon the F probability distri- 
bution, could also be used to test for significance in regression. The value in cell F24, 
.0000255, is the p-value associated with the F test for significance. Suppose the level 
of significance is a = .01. Because the p-value = .0000255 < a = .01, we can reject 
H, and conclude that we have a significant relationship between student population and 
sales. Note that it is the same conclusion that we obtained using the p-value approach 
for the ¢ test for significance. In fact, because the f test for significance is equivalent 

to the F test for significance in simple linear regression, the p-values provided by both 
approaches are identical. The label Excel uses to identify the p-value for the F test for 
significance, shown in cell F23, is Significance F. In Chapter 9 we also stated that the 
p-value is often referred to as the observed level of significance. Thus, the label Signif- 
icance F may be more meaningful if you think of the value in cell F24 as the observed 
level of significance for the F test. 


Interpretation of Regression Statistics Output 


Adjusted R Square, shown The output in cells A15:B20 summarizes the regression statistics. The number of observa- 

in cell B18 in Figure 4.13, is tions in the data set, 10, is shown in cell B20. The coefficient of determination, .9027, ap- 

not meaningful for simple pears in cell B17; the corresponding label, R Square, is shown in cell A17. The square root 

linear regression. Adjusted of the coefficient of determination provides the sample correlation coefficient of 0.9501 

R Square will be discussed shown in cell B16. Note that Excel uses the label Multiple R (cell A16) to identify this 

in more detail in chapter value. In cell A19, the label Standard Error is used to identify the value of s, the estimate 

15 where we discuss linear of ø. Cell B19 shows that the value of s is 13.8293. We caution the reader to keep in mind 

regression with more than that in the Excel output, the label Standard Error appears in two different places. In the 

one independent variable. Regression Statistics section of the output the label Standard Error refers to s, the estimate 
of ø. In the Estimated Regression Equation section of the output, the label Standard Error 
refers to Sp, the estimated standard deviation of the sampling distribution of b,. 


EXERCISES 
Applications 


40. Apartment Selling Price. The commercial division of a real estate firm conducted a 
study to determine the extent of the relationship between annual gross rents ($1000s) 
and the selling price ($1000s) for apartment buildings. Data were collected on several 
properties sold, and Excel’s Regression tool was used to develop an estimated regres- 
sion equation. A portion of the regression output follows. 

a. How many apartment buildings were in the sample? 


Write the estimated regression equation. 

Use the ż test to determine whether the selling price is related to annual gross rents. 

Use a = .05. 

d. Use the F test to determine whether the selling price is related to annual gross rents. 
Use a = .05. 

e. Predict the selling price of an apartment building with gross annual rents of 

$50,000. 


Oo 
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41. Computer Terminal Maintenance. Following is a portion of the regression output 
for an application relating maintenance expense (dollars per month) to usage (hours 
per week) for a particular brand of computer terminal. 


a. Write the estimated regression equation. 
b. Use a ż test to determine whether monthly maintenance expense is related to usage 
at the .05 level of significance. 
c. Did the estimated regression equation provide a good fit? Explain. 
42. Annual Sales and Salespeople. A regression model relating the number of salesper- 
sons at a branch office to annual sales at the office (in thousands of dollars) provided 
the following regression output. 


S 


Write the estimated regression equation. 

b. Compute the F statistic and test the significance of the relationship at the .05 level 

of significance. 

Compute the ¢ statistic and test the significance of the relationship at the .05 level of 

significance. 

. Predict the annual sales at the Memphis branch office. This branch employs 12 

salespersons. 

43. Estimating Setup Time. Sherry is a production manager for a small manufacturing 
shop and is interested in developing a predictive model to estimate the time to pro- 
duce an order of a given size—that is, the total time to produce a certain quantity of 
the product. She has collected data on the total time to produce 30 different orders of 
various quantities in the file Setup. 

š a. Develop a scatter diagram with quantity as the independent variable. 

D ATAf ile b. What does the scatter diagram developed in part (a) indicate about the relationship 

Setup between the two variables? 

c. Develop the estimated regression equation. Interpret the intercept and slope. 
d. Test for a significant relationship. Use .05. 
e. Did the estimated regression equation provide a good fit? 

44. Auto Racing Helmet. Automobile racing, high-performance driving schools, and 
driver education programs run by automobile clubs continue to grow in popularity. 
All these activities require the participant to wear a helmet that is certified by the 
Snell Memorial Foundation, a not-for-profit organization dedicated to research, 
education, testing, and development of helmet safety standards. Snell “SA” (Sports 
Application) rated professional helmets are designed for auto racing and provide 
extreme impact resistance and high fire protection. One of the key factors in se- 
lecting a helmet is weight, since lower weight helmets tend to place less stress on 


G 


Q 


(@ 
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DATA file 


RaceHelmets 


Residual analysis is the 
primary tool for determin- 
ing whether the assumed 
regression model is 
appropriate. 
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the neck. The following data show the weight and price for 18 SA helmets 
(SoloRacer website). 


Helmet Weight (oz) Price ($) 
Pyrotect Pro Airflow 64 248 
Pyrotect Pro Airflow Graphics 64 278 
RCi Full Face 64 200 
RaceQuip RidgeLine 64 200 
HJC AR-10 58 300 
HJC Si-12 47 700 
HJC HX-10 49 900 
Impact Racing Super Sport 593 340 
Zamp FSA-1 66 199 
Zamp RZ-2 58 299 
Zamp RZ-2 Ferrari 58 299 
Zamp RZ-3 Sport 52 479 
Zamp RZ-3 Sport Painted 52 479 
Bell M2 63 369 
Bell M4 62 369 
Bell M4 Pro 54 559 
G Force Pro Force 1 63 250 
G Force Pro Force 1 Grafx 63 280 


a. Develop a scatter diagram with weight as the independent variable. 

b. Does there appear to be any relationship between these two variables? 

c. Develop the estimated regression equation that could be used to predict the price 
given the weight. 

d. Test for the significance of the relationship at the .05 level of significance. 

e. Did the estimated regression equation provide a good fit? Explain. 


14.8 Residual Analysis: Validating Model 
Assumptions 


As we noted previously, the residual for observation i is the difference between the observed 
value of the dependent variable (y,) and the predicted value of the dependent variable (9). 


RESIDUAL FOR OBSERVATION i 


W= (14.28) 


where 
y; is the observed value of the dependent variable 
$, is the predicted value of the dependent variable 


In other words, the ith residual is the error resulting from using the estimated regression 
equation to predict the value of the dependent variable. The residuals for the Armand’s 
Pizza Parlors example are computed in Table 14.7. The observed values of the dependent 
variable are in the second column and the predicted values of the dependent variable, 
obtained using the estimated regression equation = 60 + 5x, are in the third column. An 
analysis of the corresponding residuals in the fourth column will help determine whether 
the assumptions made about the regression model are appropriate. 
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TABLE 14.7 Residuals for Armand's Pizza Parlors 


Student Population Sales Predicted Sales Residuals 
Xi Yi ĵi = 60 + 5x; yi- Vi 
2 58 70 =(2 
6 105 90 15) 
8 88 100 = 72 
8 118 100 18 

12 7 120 =9 
16 187 140 =9 
20 157 160 Z3 
20 169 160 9 
22 149 170 =A] 
26 202 190 12 


Let us now review the regression assumptions for the Armand’s Pizza Parlors example. 
A simple linear regression model was assumed. 


y=Bo+ Byxte (14.29) 


This model indicates that we assumed quarterly sales (y) to be a linear function of the size 
of the student population (x) plus an error term e. In Section 14.4 we made the following 
assumptions about the error term e. 


1. E(e) = 0. 

2. The variance of e, denoted by g”, is the same for all values of x. 
3. The values of e are independent. 

4. The error term e has a normal distribution. 


These assumptions provide the theoretical basis for the ¢ test and the F test used to deter- 
mine whether the relationship between x and y is significant, and for the confidence and 
prediction interval estimates presented in Section 14.6. If the assumptions about the error 
term € appear questionable, the hypothesis tests about the significance of the regression 
relationship and the interval estimation results may not be valid. 

The residuals provide the best information about e€; hence an analysis of the residuals is 
an important step in determining whether the assumptions for € are appropriate. Much of 
residual analysis is based on an examination of graphical plots. In this section, we discuss 
the following residual plots. 


1. A plot of the residuals against values of the independent variable x 

2. A plot of residuals against the predicted values of the dependent variable y 
3. A standardized residual plot 

4. A normal probability plot 


Residual Plot Against x 


A residual plot against the independent variable x is a graph in which the values of 

the independent variable are represented by the horizontal axis and the corresponding 
residual values are represented by the vertical axis. A point is plotted for each residual. 
The first coordinate for each point is given by the value of x, and the second coordinate 
is given by the corresponding value of the residual y, — y,. For a residual plot against x 
with the Armand’s Pizza Parlors data from Table 14.7, the coordinates of the first point 
are (2, — 12), corresponding to x, = 2 and y, — f; = —12; the coordinates of the second 
point are (6, 15), corresponding to x, = 6 and y, — ¥, = 15; and so on. Figure 14.14 
shows the resulting residual plot. 
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FIGURE 14.14 Plot of the Residuals Against the Independent Variable x 


for Armand's Pizza Parlors 
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Before interpreting the results for this residual plot, let us consider some general pat- 
terns that might be observed in any residual plot. Three examples appear in Figure 14.15. 
If the assumption that the variance of e is the same for all values of x and the assumed 
regression model is an adequate representation of the relationship between the variables, 
the residual plot should give an overall impression of a horizontal band of points such as 
the one in Panel A of Figure 14.15. However, if the variance of e is not the same for all val- 
ues of x—for example, if variability about the regression line is greater for larger values of 
x—a pattern such as the one in Panel B of Figure 14.15 could be observed. In this case, the 
assumption of a constant variance of e is violated. Another possible residual plot is shown 
in Panel C. In this case, we would conclude that the assumed regression model is not an 
adequate representation of the relationship between the variables. A curvilinear regression 
model or multiple regression model should be considered. 

Now let us return to the residual plot for Armand’s Pizza Parlors shown in Figure 14.14. 
The residuals appear to approximate the horizontal pattern in Panel A of Figure 14.15. 
Hence, we conclude that the residual plot does not provide evidence that the assumptions 
made for Armand’s regression model should be challenged. At this point, we are confident 
in the conclusion that Armand’s simple linear regression model is valid. 

Experience and good judgment are always factors in the effective interpretation of residual 
plots. Seldom does a residual plot conform precisely to one of the patterns in Figure 14.15. 
Yet analysts who frequently conduct regression studies and frequently review residual plots 
become adept at understanding the differences between patterns that are reasonable and 
patterns that indicate the assumptions of the model should be questioned. A residual plot 
provides one technique to assess the validity of the assumptions for a regression model. 


Residual Plot Against y 


Another residual plot represents the predicted value of the dependent variable ŷ on the 
horizontal axis and the residual values on the vertical axis. A point is plotted for each 
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FIGURE 14.15 


Residual Plots from Three Regression Studies 
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residual. The first coordinate for each point is given by ŷ; and the second coordinate is 
given by the corresponding value of the ith residual y; — 3,. With the Armand’s data from 
Table 14.7, the coordinates of the first point are (70, — 12), corresponding to , = 70 and 
yı — 3; = — 12; the coordinates of the second point are (90, 15); and so on. Figure 14.16 
provides the residual plot. Note that the pattern of this residual plot is the same as the 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


14.8 Residual Analysis: Validating Model Assumptions 655 


FIGURE 14.16 Plot of the Residuals Against the Predicted Values ý 


for Armand's Pizza Parlors 
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pattern of the residual plot against the independent variable x. It is not a pattern that would 
lead us to question the model assumptions. For simple linear regression, both the residual 
plot against x and the residual plot against ý provide the same pattern. For multiple regres- 
sion analysis, the residual plot against is more widely used because of the presence of 
more than one independent variable. 


Standardized Residuals 


Many of the residual plots provided by computer software packages use a standardized 
version of the residuals. As demonstrated in preceding chapters, a random variable is stan- 
dardized by subtracting its mean and dividing the result by its standard deviation. With the 
least squares method, the mean of the residuals is zero. Thus, simply dividing each residual 
by its standard deviation provides the standardized residual. 

It can be shown that the standard deviation of residual i depends on the standard error of 
the estimate s and the corresponding value of the independent variable x,. 


STANDARD DEVIATION OF THE ith RESIDUAL? 


SG = Wl = fo; (14.30) 
where 


Sy,- $, = the standard deviation of residual i 


s = the standard error of the estimate 
al (x, — x) 


h 
i on EG 


(14.31) 


This equation actually provides an estimate of the standard deviation of the ith residual, because s is used instead of o. 
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Note that equation (14.30) shows that the standard deviation of the ith residual depends on 
x, because of the presence of h, in the formula.’ Once the standard deviation of each resid- 
ual is calculated, we can compute the standardized residual by dividing each residual by its 
corresponding standard deviation. 


STANDARDIZED RESIDUAL FOR OBSERVATION i 


AT (14.32) 


Sy, = 


Table 14.8 shows the calculation of the standardized residuals for Armand’s Pizza Parlors. 
Recall that previous calculations showed s = 13.829. Figure 14.17 is the plot of the stan- 
dardized residuals against the independent variable x. 
Small departures from nor- The standardized residual plot can provide insight about the assumption that the error 
mality do not have a great term e has a normal distribution. If this assumption is satisfied, the distribution of the stan- 
effect on the statistical tests dardized residuals should appear to come from a standard normal probability distribution.“ 
used in regression analysis. Thus, when looking at a standardized residual plot, we should expect to see approximately 
95% of the standardized residuals between —2 and +2. We see in Figure 14.17 that for the 
Armand’s example all standardized residuals are between —2 and +2. Therefore, on the 
basis of the standardized residuals, this plot gives us no reason to question the assumption 
that € has a normal distribution. 

Because of the effort required to compute the estimated values of 9, the residuals, 
and the standardized residuals, most statistical packages provide these values as optional 
regression output. Hence, residual plots can be easily obtained. For large problems com- 
puter packages are the only practical means for developing the residual plots discussed 
in this section. 


TABLE 14.8 Computation of Standardized Residuals for Armand's Pizza Parlors 


(x; — x)? Standardized 
Restauranti x; xx (xļ-x? E(x- x)? h; s y;— ý; Residual 
1 2 =j 144 2535 3595 ENSS = I2 = 0792 
2 6 —8 64 27 227 12.2709 15 1.2224 
3 8 =6 36 .0634 .1634 12.6493 = 12 —.9487 
4 8 =6 36 .0634 .1634 12.6493 18 1.4230 
5 12 =2 4 .0070 .1070 13.0682 =$) —.2296 
6 16 2 4 .0070 .1070 13.0682 3 2296 
7 20 6 36 .0634 .1634 12.6493 =8 2S2 
8 20 6 36 .0634 .1634 12.6493 2 ANS 
2 22 8 64 27 2127 12.2709 —21 —1.7114 
10 26 12 144 P2535) 3535 RSS 2 1.0792 


Total 568 


Note: The values of the residuals were computed in Table 14.7. 


3h, is referred to as the leverage of observation i. Leverage will be discussed further when we consider influential 
observations in Section 14.9. 

“Because s is used instead of ø in equation (14.30), the probability distribution of the standardized residuals is not 
technically normal. However, in most regression studies, the sample size is large enough that a normal approxima- 
tion is very good. 
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FIGURE 14.17 Plot of the Standardized Residuals Against the Independent 


Variable x for Armand's Pizza Parlors 
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E) 
Using Excel to Construct a Residual Plot 


In Section 14.7 we showed how Excel’s Regression tool could be used for regression 
analysis. The Regression tool also provides the capability to obtain a residual plot against 
the independent variable x and, when used with Excel’s chart tools, the Regression tool 
residual output can also be used to construct a residual plot against 3 as well as an Excel 
version of a standardized residual plot. 


Residual Plot Against x To obtain a residual plot against x, the steps that we describe in 
Section 14.7 in order to obtain the regression output are performed with one change. When 
the Regression tool dialog box appears (see Figure 14.13), we must also select the Residual 
Plots option in the Residual section. The regression output will appear as described previ- 
ously, and the worksheet will also contain a chart showing a plot of the residuals against the 
independent variable Population. In addition, a list of predicted values of y and the corre- 
sponding residual values are provided below the regression output. Figure 14.18 shows the 
residual output for the Armand’s Pizza Parlors problem; note that rows 12-32, containing the 
standard Regression tool output, have been hidden to better focus on the residual portion of 
the output. We see that the shape of this plot is the same as shown previously in Figure 14.14. 


Residual Plot Against y Using Excel’s chart tools and the residual output provided in 
Figure 14.18, we can easily construct a residual plot against ý. The following steps describe 
how to use Excel’s chart tools to construct the residual plot using the regression tool output 
in the worksheet. 


Step 1. Select cells B37:C46 
Step 2. Click the Insert tab on the Ribbon 
Step 3. In the Charts group, click Insert Scatter (X, Y) or Bubble Chart 
Step 4. When the list of scatter diagram subtypes appears: 
Click Scatter with only Markers (the chart in the upper left corner) 
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FIGURE 14.18 | Regression Tool Residual Output for the Armand's Pizza Parlors Problem 
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Note: Rows 12-32 are hidden. 


The resulting chart will look similar to the residual plot shown in Figure 14.16. Adding 
a chart title, labels for the horizontal and vertical axes, as well as other formatting options 
can be easily done. Note that except for using different data to construct the chart as well 
as different labels for the chart output, the steps describing how to use Excel’s chart tools 
to construct a residual plot against ŷ are the same as the steps we used to construct a scatter 
diagram in Section 14.2. 


Excel's Standardized Residual Plot Excel can be used to construct what it calls a stan- 
dardized residual plot. Excel’s standardized residual plot is really an approximation of the 
true standardized residual plot. In Excel, the standard deviation of the ith residual is not 
computed using equation (14.30). Instead, Excel estimates s, _ ; using the standard devi- 
ation of the n residual values. Then, dividing each residual by this estimate, Excel obtains 
what it refers to as a standard residual. The plot of these standard residuals is what you get 
when you request a plot of the standardized residuals using Excel’s Regression tool. 

We will illustrate how to construct a standardized residual plot using Excel for the 
Armand’s Pizza Parlors problem. The residuals for the Armand’s Pizza Parlors problem 
are — 12, 15, — 12, 18, —3, —3, —3, 9, —21, and 12. Using Excel’s STDEV.S function, we 
computed a standard deviation of 13.0384 for these 10 data values. To compute the standard 
residuals, Excel divides each residual by 13.0384; the results are shown in Table 14.9. Both 
the standardized residuals, computed in Table 14.8, and Excel’s standard residuals are shown. 
There is not a great deal of difference between Excel’s standard residuals and the true stan- 
dardized residuals. In general, the differences get smaller as the sample size increases. Often 
we are interested only in identifying the general pattern of the points in a standardized residual 
plot; in such cases, the small differences between the standardized residuals and Excel’s stan- 
dard residuals will have little effect on the pattern observed. Thus these differences will not in- 
fluence the conclusions reached when we use the residual plot to validate model assumptions. 

The Regression tool and the chart tools can be used to obtain Excel’s standardized 
residual plot. First, the steps that we described in Section 14.7 in order to conduct a 
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TABLE 14.9 | Computation of Excel's Standard Residuals 


Values from Table 14.8 Values Using Excel 
Standardized Estimate of Standard 
Restaurant i y;-y Sy,— 4, Residual Say Residual 
1 =12 WIS = O72 13.0384 —0.9204 
2 5) 12.2709 {| 2224 13.0384 1.1504 
3 =|2 12.6493 —.9487 13.0384 —0.9204 
4 18 12.6493 1.4230 13.0384 1.3805 
5 = 8) 13.0682 = 22XS 13.0384 S20 
6 =$ 13.0682 = 2286 13.0384 = 0220 
if = 12.6493 = 2302 13.0384 = 0220] 
8 9) 12.6493 IVS 13.0384 0.6903 
©) =21 12.2709 ANA 13.0384 —1.6106 
10 12 11111103 ROMO. 13.0384 0.9204 


regression analysis are performed with one change. When the Regression dialog box 
appears (see Figure 14.13), we must select the Standardized Residuals option. In addition 
to the regression output described previously, the output will contain a list of predicted 
values of y, residuals, and standard residuals, as shown in cells A34:D46 in Figure 14.19. 


FIGURE 14.19 | Standardized Residual Plot Against the Independent Variable 
Population for the Armand’s Pizza Parlors Example 
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— Note: Rows 1-33 are hidden. 
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The Standardized Residuals option does not automatically produce a standardized 
residual plot. But we can use Excel’s chart tools to construct a scatter diagram in which 
the values of the independent variable are placed on the horizontal axis and the values of 
the standard residuals are placed on the vertical axis. The procedure that describes how to 
use Excel’s chart tools to construct a standardized residual plot is similar to the steps we 
showed for using Excel’s chart tools to construct a residual plot against ý. Because it is 
somewhat easier to plot two variables in Excel when the variables are in adjacent columns 
of the workbook, we copied the data and heading for the independent variable Population 
into cells F36:F46 and the data and heading for the standard residuals into cells G36:G46. 

Using the data in cells F36:G46 and Excel’s chart tools we obtained the scatter diagram 
shown in Figure 14.19; this scatter diagram is Excel’s version of the standardized residual 
plot for the Armand’s Pizza Parlors example. Comparing Excel’s version of the standard- 
ized residual plot to the standardized residual plot in Figure 14.16, we see the same pattern 
evident. All of the standardized residuals in both figures are between —2 and +2, indicat- 
ing no reason to question the assumption that € has a normal distribution. 


Normal Probability Plot 


Another approach for determining the validity of the assumption that the error term has a 
normal distribution is the normal probability plot. To show how a normal probability plot 
is developed, we introduce the concept of normal scores. 

Suppose 10 values are selected randomly from a normal probability distribution with 
a mean of zero and a standard deviation of one, and that the sampling process is repeated 
over and over with the values in each sample of 10 ordered from smallest to largest. For 
now, let us consider only the smallest value in each sample. The random variable represent- 
ing the smallest value obtained in repeated sampling is called the first-order statistic. 

Statisticians show that for samples of size 10 from a standard normal probability distri- 
bution, the expected value of the first-order statistic is — 1.55. This expected value is called 
a normal score. For the case with a sample of size n = 10, there are 10 order statistics and 
10 normal scores (see Table 14.10). In general, a data set consisting of n observations will 
have n order statistics and hence n normal scores. 

Let us now show how the 10 normal scores can be used to determine whether the stand- 
ardized residuals for Armand’s Pizza Parlors appear to come from a standard normal proba- 
bility distribution. We begin by ordering the 10 standardized residuals from Table 14.8. The 
10 normal scores and the ordered standardized residuals are shown together in Table 14.11. 

If the normality assumption is satisfied, the smallest standardized residual should be close to 
the smallest normal score, the next smallest standardized residual should be close to the next 


TABLE 14.10 Normal Scores for n = 10 


Order Normal 
Statistic Score 
= (155 
—1.00 


ODO ON OO RAUN > 
= 
N 


= 
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TABLE 14.11 Normal Scores and Ordered Standardized 


Residuals for Armand’s Pizza Parlors 


Ordered 

Normal Standardized 
Scores Residuals 
= 55) e714 
—1.00 2170792 
= O5 = 9487 
= Soe, 
=12 2296 
12 = 2296 
sol of 1S 
.65 110792 
1.00 1.2224 
1:55 14230 


smallest normal score, and so on. If we were to develop a plot with the normal scores on the 
horizontal axis and the corresponding standardized residuals on the vertical axis, the plotted 
points should cluster closely around a 45-degree line passing through the origin if the stand- 
ardized residuals are approximately normally distributed. Such a plot is referred to as a normal 
probability plot. 

Figure 14.20 is the normal probability plot for the Armand’s Pizza Parlors example. 
Judgment is used to determine whether the pattern observed deviates from the line enough 
to conclude that the standardized residuals are not from a standard normal probability 


FIGURE 14.20 Normal Probability Plot for Armand's Pizza Parlors 
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distribution. In Figure 14.20, we see that the points are grouped closely about the line. 
We therefore conclude that the assumption of the error term having a normal probability 
distribution is reasonable. In general, the more closely the points are clustered about 

the 45-degree line, the stronger the evidence supporting the normality assumption. Any 
substantial curvature in the normal probability plot is evidence that the residuals have not 
come from a normal distribution. 


NOTES + COMMENTS 


1. We use residual and normal probability plots to validate 2. Analysis of residuals is the primary method statisticians use 


the assumptions of a regression model. If our review indi- 
cates that one or more assumptions are questionable, a 
different regression model or a transformation of the data 
should be considered. The appropriate corrective action 
when the assumptions are violated must be based on 
good judgment; recommendations from an experienced 


to verify that the assumptions associated with a regression 
model are valid. Even if no violations are found, it does 
not necessarily follow that the model will yield good pre- 
dictions. However, if additional statistical tests support the 
conclusion of significance and the coefficient of determina- 
tion is large, we should be able to develop good estimates 


statistician can be valuable. and predictions using the estimated regression equation. 


EXERCISES 


Methods 

45. Given are data for two variables, x and y. 
x, {6 11 15 18 20 
y,16 8 12 20 30 


a. Develop an estimated regression equation for these data. 

b. Compute the residuals. 

c. Develop a plot of the residuals against the independent variable x. Do the assump- 
tions about the error terms seem to be satisfied? 

d. Compute the standardized residuals. 

e. Develop a plot of the standardized residuals against y. What conclusions can you 
draw from this plot? 


Observation x; y; Observation x; yi 
í 2 4 6 7 6 
2 3 5 7 7 9 
3 4 4 8 8 5 
4 5 6 9 9 11 
5) 7 4 


46. The following data were used in a regression study. 
a. Develop an estimated regression equation for these data. 
b. Construct a plot of the residuals. Do the assumptions about the error term seem to 
be satisfied? 


Applications 
47. Restaurant Advertising and Revenue. Data on advertising expenditures and revenue 
(in thousands of dollars) for the Four Seasons Restaurant follow. 
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Advertising Expenditures Revenue 
1 19 
2 32 
4 44 
6 40 
10 52 
14 53 
20 54 


a. Let x equal advertising expenditures and y equal revenue. Use the method of least 
squares to develop a straight line approximation of the relationship between the two 
variables. 

b. Test whether revenue and advertising expenditures are related at a .05 level of 
significance. 

c. Prepare a residual plot of y — » versus ¥. Use the result from part (a) to obtain the 
values of ĵ. 

d. What conclusions can you draw from residual analysis? Should this model be used, 
or should we look for a better one? 

48. Experience and Sales. Refer to exercise 7, where an estimated regression equation 
relating years of experience and annual sales was developed. 

a. Compute the residuals and construct a residual plot for this problem. 

b. Do the assumptions about the error terms seem reasonable in light of the residual 


=> , plot? 
= DATA f ile 49. Cost of Living Index. The file CostLiving contains the cost of living indexes and the 
CostLiving population densities (number of people per square mile) for 61 cities in the United 


States. The cost of living index measures the cost of living in a particular city relative to 

the cost of living in New York City. San Francisco has an index of 112.15, meaning that 

itis 12.15% more expensive to live in San Francisco than New York City. Washington, 

DC, has an index of 84.64, which means that the cost to live in Washington, DC, is only 

84.64% of what it costs to live in New York City (Numbeo website). 

a. Develop the estimated regression equation that can be used to predict the cost of 
living index for a U.S. city, given the city’s population density. 

b. Construct a residual plot against the independent variable. 

c. Review the residual plot constructed in part (b). Do the assumptions of the error 
term and model form seem reasonable? 


14.9 Outliers and Influential Observations 


In this section we discuss how to identify observations that can be classified as outliers 
or as being especially influential in determining the estimated regression equation. Some 
steps that should be taken when such observations are identified are provided. 


Detecting Outliers 


An outlier is a data point (observation) that does not fit the trend shown by the remaining 
data. Outliers represent observations that are suspect and warrant careful examination. 
They may represent erroneous data; if so, they should be corrected. They may signal a 
violation of model assumptions; if so, another model should be considered. Finally, they 
may simply be unusual values that have occurred by chance. In this case, they should be 
retained. 

To illustrate the process of detecting outliers, consider the data set in Table 14.12; 
Figure 14.21 shows the scatter diagram for these data and a portion of the Regression tool 
output, including the tabular residual output obtained using the Standardized Residuals 
option. The estimated regression equation is ŷ = 64.95 — 7.330x and R Square is .4968; 
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TABLE 14.12 Data Set Illustrating the Effect 


of an Outlier 


x 
< 


1 
1 
2 
3 
3 
3 45 
4 
4 
5 
6 


FIGURE 


50 y =-7.3305x + 64.958 


DAWU ewe e 


Dependent Variable y 


3 4 
Indepenedent Variable x 


The standardized residual for 
observation 4 is greater than +2. 
Hence we consider observation 4 to be 
an outlier. 


Note: Rows 22-33 are hidden. 


thus, only 49.68% of the variability in the values of y is explained by the estimated regres- 
sion equation. However, except for observation 4 (x, = 3, y, = 75), a pattern suggesting a 
strong negative linear relationship is apparent. Indeed, given the pattern of the rest of the 
data, we would have expected y, to be much smaller and hence would consider observation 
4 to be an outlier. For the case of simple linear regression, one can often detect outliers by 
simply examining the scatter diagram. 
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FIGURE 14.22 Regression Tool Output for the Revised Outlier Data Set 


s0 


y =-6.9492x + 59.237 


COO me We N oe 


30 


Dependent Variable y 
è 
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The standardized residuals can also be used to identify outliers. If an observation devi- 
ates greatly from the pattern of the rest of the data, the corresponding standardized residual 
will be large in absolute value. We recommend considering any observation with a stan- 
dardized residual of less than —2 or greater than +2 as an outlier. With normally distrib- 
uted errors, standardized residuals should be outside these limits approximately 5% of the 
time. In the residual output section of Figure 14.21 we see that the standard residual value 
for observation 4 is 2.68; this value suggests we treat observation 4 as an outlier. 

In deciding how to handle an outlier, we should first check to see whether it is a 
valid observation. Perhaps an error has been made in initially recording the data or 
in entering the data into the worksheet. For example, suppose that in checking the 
data in Table 14.11, we find that an error has been made and that the correct value 
for observation 4 is x, = 3, y4 = 30. Figure 14.22 shows a portion of the Regression 
tool output after correction of the value of y,. The estimated regression equation is 
ý = 59.23 — 6.949x and R Square is .8380. Note also that no standard residuals are 
less than —2 or greater than +2; hence, the revised data contain no outliers. We see 
that using the incorrect data value had a substantial effect on the goodness of fit. With 
the correct data, the value of R Square has increased from .4968 to .8380 and the 
value of by has decreased from 64.95 to 59.23. The slope of the line has changed from 
—7.330 to —6.949. The identification of the outlier enables us to correct the data error 
and improve the regression results. 


Detecting Influential Observations 


An influential observation is an observation that has a strong influence on the regression 
results. An influential observation may be an outlier (an observation with a y value that 
deviates substantially from the trend of the remaining data), it may correspond to an x value 
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TABLE 14.13 Data Set Illustrating the Effect of 


an Influential Observation 


Xi yi 
10 125 
10 130 
15 120 
20 115 
20 120 
25 110 
70 100 


far from its mean (extreme x value), or it may be caused by a combination of a somewhat 
off-trend y value and a somewhat extreme x value. Because influential observations may 
have such a dramatic effect on the estimated regression equation, they must be examined 
carefully. First, we should check to make sure no error has been made in collecting or 
recording the data. If such an error has occurred, it can be corrected and a new estimated 
regression equation developed. If the observation is valid, we might consider ourselves 
fortunate to have it. Such a point, if valid, can contribute to a better understanding of the 
appropriate model and can lead to a better estimated regression equation. 

To illustrate the process of detecting influential observations, consider the data set in 
Table 14.13. The top part of Figure 14.23 shows the scatter diagram for these data and 


FIGURE 
A | B | Cc | D E E G H I J | K 
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4 > 125 e D 
2 y = -0.4251x + 127.47 
5 4 120 
t 
6 5 115 
T = 110 
8 H 105 
z Š 10 
10 
11 95 
12 30 
B o 10 20 30 40 50 60 70 80 
Independent Variable x 


= 
> 


15 (Data Set With Observation 7 Deleted 
16 x y 

17 
18 
19 
20 
21 


y =-1.0909x + 138.18 


Dependent Variable y 


0 10 20 30 40 50 60 70 20 
Independent Variable x 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


14.9 Outliers and Influential Observations 667 


the graph of the corresponding estimated regression equation ¥ = 127.4 — .425x. The 
bottom part of Figure 14.23 shows the scatter diagram for the data in Table 14.13 with 
observation 7 (x, = 70, y, = 100) deleted; for these data the estimated regression equa- 
tion is » = 138.1 — 1.090x. With observation 7 deleted, the value of bọ has increased 
from 127.4 to 138.1. The slope of the line has changed from —0.425 to — 1.090. The ef- 
fect of observation 7 on the regression results is dramatic and is confirmed by looking at 
the graphs of the two estimated regression equations. Clearly observation 7 is influential. 

Observations with extreme values for the independent variables are called high leverage 
points. Observation 7 in the data set shown in Table 14.13 (which we have identified as 
being influential) is a point with high leverage. The leverage of an observation is deter- 
mined by how far the value of the independent variable is from its mean value. For the 
single-independent-variable case, the leverage of the ith observation, denoted h, can be 
computed using equation (14.33). 


LEVERAGE OF OBSERVATION i 
1 a oe 
ee (14.33) 
it Ci 29) 


From the formula, it is clear that the farther x; is from its mean x, the higher the leverage of 
observation i. For the data in Table 14.13, the leverage of observation 7 is computed using 
equation (14.33) as follows: 


1 @-x? 1 (70— 24.286) 
h,=—+ — =+ = 
ion Se- 7 2621.43 


For simple linear regression, we consider observations as having high leverage if h; > 6/n; 
for the data in Table 14.13, 6/n = 6/7 = .86. Thus, because h; = .94 > .86, observation 7 
would be identified as having high leverage. Data points having high leverage are often 
influential. Influential observations that are caused by an interaction of somewhat large 
residuals and somewhat high leverage can be difficult to detect. Diagnostic procedures are 
available that take both into account in determining when an observation is influential. 
More advanced books on regression analysis discuss the use of such procedures. Excel 
does not have built-in capabilities for identifying outliers and high leverage points. Thus, 
we recommend reviewing the scatter diagram after fitting the regression line. For any point 
significantly off the line, rerun the regression analysis after deleting the observation. If the 
results change dramatically, the point in question is an influential observation. 


NOTE + COMMENT 


Once an observation is identified as potentially influential be- one is not familiar with the more advanced material, a simple 
cause of a large residual or high leverage, its impact on the procedure is to run the regression analysis with and without 
estimated regression equation should be evaluated. More the observation. This approach will reveal the influence of the 
advanced texts discuss diagnostics for doing so. However, if | observation on the results. 


EXERCISES 


Methods 

50. Consider the following data for two variables, x and y. 
x; | 135 110 130 145 175 160 120 
y; | 145 100 120 120 130 130 110 
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a. Develop a scatter diagram for these data. Does the scatter diagram indicate any 
outliers in the data? In general, what implications does this finding have for simple 
linear regression? 

b. Compute the standardized residuals for these data. Do the data include any outliers? 
Explain. 

51. Consider the following data for two variables, x and y. 


x| 4 5 7 8 10 12 #12 -22 
y,|12 14 16 15 18 20 24 19 


a. Develop a scatter diagram for these data. Does the scatter diagram indicate any 
influential observations? Explain. 

b. Compute the standardized residuals for these data. Do the data include any outliers? 
Explain. 

c. Compute the leverage values for these data. Does there appear to be any influential 
observations in these data? Explain. 


Applications 

52. Predicting Charity Expenses. Charity Navigator is America’s leading independent 
charity evaluator. The following data show the total expenses ($), the percentage of 
the total budget spent on administrative expenses, the percentage spent on fundraising, 
and the percentage spent on program expenses for 10 supersized charities (Charity 
Navigator website). Administrative expenses include overhead, administrative staff 
and associated costs, and organizational meetings. Fundraising expenses are what a 
charity spends to raise money, and program expenses are what the charity spends on 
the programs and services it exists to deliver. The sum of the three percentages does 
not add to 100% because of rounding. 


Administrative Fundraising Program 


Total Expenses Expenses Expenses Expenses 
Charity ($) (%) (%) (%) 
American Red Cross 3,354,177,445 39. 3.8 92.1 
ee ; World Vision 1,205,887,020 4.0 IS 88.3 
4 
S= DATA file Smithsonian Institution 1,080,995,083 23.5 26 Tea) 
Charities Food For The Poor 1,050,829,851 J 2.4 96.8 
American Cancer Society 1,003,781,897 6.1 222 71.6 
Volunteers of America 929,158,968 8.6 1.9 89.4 
Dana-Farber Cancer 877,321,613 Ba 1.6 85.2 
Institute 
AmeriCares 854,604,824 4 aif 98.8 
ALSAC—St. Jude Chil- 829,662,076 9.6 16.9 734 
dren's Research Hospital 
City of Hope 736,176,619 Sh 3.0 83.1 


a. Develop a scatter diagram with fundraising expenses (%) on the horizontal axis and 
program expenses (%) on the vertical axis. Looking at the data, do there appear to 
be any outliers and/or influential observations? 

b. Develop an estimated regression equation that could be used to predict program 
expenses (%) given fundraising expenses (%). 

c. Does the value for the slope of the estimated regression equation make sense in the 
context of this problem situation? 

d. Use residual analysis to determine whether any outliers and/or influential observa- 
tions are present. Briefly summarize your findings and conclusions. 
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53. Supermarket Checkout Lines. Retail chain Kroger has more than 2700 locations 
and is the largest supermarket in the United States based on revenue. Kroger has 
invested heavily in data, technology, and analytics. Feeding predictive models with 
data from an infrared sensor system called QueVision to anticipate when shoppers will 
reach the checkout counters, Kroger is able to alert workers to open more checkout 
lines as needed. This has allowed Kroger to lower its average checkout time from four 
minutes to less than 30 seconds (Retail Touchpoints). 

DATA file Consider the data in the file Checkout. The file contains 32 observations. Each 
observation gives the arrival time (measured in minutes before 6 P.M.) and the shop- 


(Q 


Checkout 

ping time (measured in minutes). 

a. Develop a scatter diagram for arrival time as the independent variable. 

b. What does the scatter diagram developed in part (a) indicate about the relationship 
between the two variables? Do there appear to be any outliers or influential observa- 
tions? Explain. 

c. Using the entire data set, develop the estimated regression equation that can be used 
to predict the shopping time given the arrival time. 

d. Use residual analysis to determine whether any outliers or influential observations 
are present. 

e. After looking at the scatter diagram in part (a), suppose you were able to visually identify 
what appears to be an influential observation. Drop this observation from the data set and 
fit an estimated regression equation to the remaining data. Compare the estimated slope 
for the new estimated regression equation to the estimated slope obtained in part (c). 
Does this approach confirm the conclusion you reached in part (d)? Explain. 

54. The Value of a Major League Baseball Team. The following data show the an- 
nual revenue ($ millions) and the estimated team value ($ millions) for the 30 Major 
League Baseball teams (Forbes website). 

Team Revenue ($ millions) Value ($ millions) 
Arizona Diamondbacks 195 584 
Atlanta Braves 225 629 
Baltimore Orioles 206 618 
D P Boston Red Sox 336 1312 
= DATA file Chicago Cubs 274 1000 
MLBValues Chicago White Sox 216 692 
Cincinnati Reds 202 546 
Cleveland Indians 186 559 
Colorado Rockies 109 587 
Detroit Tigers 238 643 
Houston Astros 196 626 
Kansas City Royals 169 457 
Los Angeles Angels of Anaheim 239 718 
Los Angeles Dodgers 245 1615 
Miami Marlins 195 520 
Milwaukee Brewers 201 562 
Minnesota Twins 214 578 
New York Mets 292 811 
New York Yankees 471 2300 
Oakland Athletics 173 468 
Philadelphia Phillies 2D) 893 
Pittsburgh Pirates 178 479 
San Diego Padres 189 600 
San Francisco Giants 262 786 
Seattle Mariners 215 644 

(continued) 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


670 Chapter 14 Simple Linear Regression 


Team Revenue ($ millions) Value ($ millions) 
St. Louis Cardinals D3) 716 
Tampa Bay Rays 167 451 
Texas Rangers 239 764 
Toronto Blue Jays 203 568 
Washington Nationals 225 631 


a. Develop a scatter diagram with Revenue on the horizontal axis and Value on the 
vertical axis. Looking at the scatter diagram, does it appear that there are any outli- 
ers and/or influential observations in the data? 

b. Develop the estimated regression equation that can be used to predict team value 
given the value of annual revenue. 

c. Use residual analysis to determine whether any outliers and/or influential observa- 
tions are present. Briefly summarize your findings and conclusions. 


14.10 Practical Advice: Big Data and Hypothesis Testing in 
Simple Linear Regression 


In Chapter 7, we observed that the standard errors of the sampling distributions of the sam- 
ple mean x (shown in formula 7.2) and the sample proportion of p (shown in formula 7.5) 
decrease as the sample size increases. In Chapters 8 and 9, we observed that this results in 
narrower confidence interval estimates for u and p and smaller p-values for the tests of the 
hypotheses Hy: u S my and Hy: p = po as the sample size increases. These results extend to 
simple linear regression. In simple linear regression, as the sample size increases, 


e the p-value for the t- test used to determine whether a significant relationship exists 
between the dependent variable and the independent decreases; 

e the confidence interval for the slope parameter associated with the independent vari- 
able narrows; 

e the confidence interval for the mean value of y narrows; 

e the prediction interval for an individual value of y narrows. 


Thus, as the sample size increases, we are more likely to reject the hypothesis that a rela- 
tionship does not exist between the dependent variable and the independent variable and 
conclude that a relationship exists. The interval estimates for the slope parameter associ- 
ated with the independent variable, the mean value of y, and predicted individual value of y 
will become more precise as the sample size increases. But this does not necessarily mean 
that these results become more reliable as the sample size increases. 

No matter how large the sample used to estimate the simple linear regression equation, 
we must be concerned about the potential presence of nonsampling error in the data. It 
is important to carefully consider whether a random sample of the population of interest 
has actually been taken. If the data to be used for testing the hypothesis of no relationship 
between the independent and dependent variable are corrupted by nonsampling error, 
the likelihood of making a Type I or Type II error may be higher than if the sample data 
are free of nonsampling error. If the relationship between the independent and dependent 
variable is statistically significant, it is also important to consider whether the relationship 
in the simple linear regression equation is of practical significance. 

Although simple linear regression is an extremely powerful statistical tool, it provides 
evidence that should be considered only in combination with information collected from 
other sources to make the most informed decision possible. No business decision should be 
based exclusively on inference in simple linear regression. Nonsampling error may lead to 
misleading results, and practical significance should always be considered in conjunction 
with statistical significance. This is particularly important when a hypothesis test is based 
on an extremely large sample because p-values in such cases can be extremely small. When 
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executed properly, inference based on simple linear regression can be an important com- 
ponent in the business decision-making process. 


SUMMARY 


ecoeeeeoeeeeoeeeeeeeeeeeeeeeeeeeeoeeeeeees 


In this chapter we showed how regression analysis can be used to determine how a de- 
pendent variable y is related to an independent variable x. In simple linear regression, the re- 
gression model is y = By + Bx + e. The simple linear regression equation E( y) = By + x 
describes how the mean or expected value of y is related to x. We used sample data and the 
least squares method to develop the estimated regression equation ) = by + b,x. In effect, bo 
and b; are the sample statistics used to estimate the unknown model parameters B, and £. 

The coefficient of determination was presented as a measure of the goodness of fit for the 
estimated regression equation; it can be interpreted as the proportion of the variation in the de- 
pendent variable y that can be explained by the estimated regression equation. We reviewed cor- 
relation as a descriptive measure of the strength of a linear relationship between two variables. 

The assumptions about the regression model and its associated error term € were 
discussed, and ¢ and F tests, based on those assumptions, were presented as a means for 
determining whether the relationship between two variables is statistically significant. We 
showed how to use the estimated regression equation to develop confidence interval esti- 
mates of the mean value of y and prediction interval estimates of individual values of y. 

The computer solution of regression problems and the use of residual analysis to 
validate the model assumptions and to identify outliers and influential observations were 
discussed. In the final section, we discussed the impact of big data on interpreting hypothe- 
sis tests in simple linear regression. 


SCCHEOSSHSHSHSHSHSSSHSHSHSHSHSHSHSHSSHSHSHSHSSHSHSHSHSHSHSHSSHSHSHSHSHSHSOSSHHSHSHSHSHSHESSSESESESEESEE®S 


ANOVA table The analysis of variance table used to summarize the computations associ- 
ated with the F test for significance. 

Coefficient of determination A measure of the goodness of fit of the estimated regression 
equation. It can be interpreted as the proportion of the variability in the dependent variable 
y that is explained by the estimated regression equation. 

Confidence interval The interval estimate of the mean value of y for a given value of x. 
Correlation coefficient A measure of the strength of the linear relationship between two 
variables (previously discussed in Chapter 3). 

Dependent variable The variable that is being predicted or explained. It is denoted by y. 
Estimated regression equation The estimate of the regression equation developed from 
sample data by using the least squares method. For simple linear regression, the estimated 
regression equation is ) = by + b,x. 

High leverage points Observations with extreme values for the independent variables. 
Independent variable The variable that is doing the predicting or explaining. It is denoted 
by x. 

Influential observation An observation that has a strong influence or effect on the regres- 
sion results. 

ith residual The difference between the observed value of the dependent variable and 

the value predicted using the estimated regression equation; for the ith observation the ith 
residual is y; — ¥,. 
Least squares method A procedure used to develop the estimated regression equation. 
The objective is to minimize X (y; — $J. 

Mean square error The unbiased estimate of the variance of the error term g”. It is de- 
noted by MSE or 5°. 

Normal probability plot A graph of the standardized residuals plotted against values of 
the normal scores. This plot helps determine whether the assumption that the error term has 
a normal probability distribution appears to be valid. 
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Outlier A data point or observation that does not fit the trend shown by the remaining data. 
Prediction interval The interval estimate of an individual value of y for a given value of x. 
Regression equation The equation that describes how the mean or expected value of 

the dependent variable is related to the independent variable; in simple linear regression, 
EO) = Bo + Bix. 

Regression model The equation that describes how y is related to x and an error term; in 
simple linear regression, the regression model is y = By + Bx + €. 

Residual analysis The analysis of the residuals used to determine whether the assump- 
tions made about the regression model appear to be valid. Residual analysis is also used to 
identify outliers and influential observations. 

Residual plot Graphical representation of the residuals that can be used to determine 
whether the assumptions made about the regression model appear to be valid. 

Scatter diagram A graph of bivariate data in which the independent variable is on the 
horizontal axis and the dependent variable is on the vertical axis. 

Simple linear regression Regression analysis involving one independent variable and one 
dependent variable in which the relationship between the variables is approximated by a 
straight line. 

Standard error of the estimate The square root of the mean square error, denoted by s. It 
is the estimate of ø, the standard deviation of the error term e. 

Standardized residual The value obtained by dividing a residual by its standard deviation. 


oe cases 
Simple Linear Regression Model 
y=Bo+ Bixte (14.1) 
Simple Linear Regression Equation 
E) = By + Bix (14.2) 


Estimated Simple Linear Regression Equation 


= by + dx (14.3) 
Least Squares Criterion 
min 5; — 3) (14.5) 
Slope and y-Intercept for the Estimated Regression Equation 
pa N (14.6) 
Dx; — x)" 
bo =y— b,x (14.7) 
Sum of Squares Due to Error 
SSE = 50; - $ (14.8) 
Total Sum of Squares 
SST = $; — y7 (14.9) 
Sum of Squares Due to Regression 
SSR = D0, — yy’ (14.10) 


Relationship Among SST, SSR, and SSE 
SST = SSR + SSE (14.11) 
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Coefficient of Determination 
_ SSR 


r =— (14.12) 
SST 
Sample Correlation Coefficient 
Ty = (sign of b,)V Coefficient of determination (14.13) 
= (sign of b) Vr i 
Mean Square Error (Estimate of o°) 
SSE 
s$ = MSE = (14.15) 
n=2 
Standard Error of the Estimate 
SSE 
s = VMSE = (14.16) 
n=2 
Standard Deviation of b, 
d (14.17) 
0, = SS e 
Estimated Standard Deviation of b, 
z (14.18) 
sS, = -a . 
4 SG Hay 
t Test Statistic 
b; 
t=— (14.19) 
Sp, 
Mean Square Regression 
SSR 
MSR = - - (14.20) 
Number of independent variables 
F Test Statistic 
MSR 
= (14.21) 
MSE 
Estimated Standard Deviation of y* 
1 t-I 
s =s a (14.23) 
n aea 
Confidence Interval for E( y*) 
JË E tans" (14.24) 
Estimated Standard Deviation of an Individual Value 
1 (xy 
Spread = 8 a a a (14.26) 
n YO; - x) 
Prediction Interval for y* 
JE E toScred (14.27) 
Residual for Observation i 
X- J, (14.28) 
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Standard Deviation of the ith Residual 


5-3 ~SV1—-A, (14.30) 
Standardized Residual for Observation i 
y,— ŷ. 
cae (14.32) 
Sy 3, 
Leverage of Observation i 
1 GTX 2 
ee (14.33) 
n DAEA = 


SUPPLEMENTARY EXERCISES 


55. Stock Market Performance. The Dow Jones Industrial Average (DJIA) and the 
Standard & Poor’s 500 (S&P 500) indexes are used as measures of overall movement 
in the stock market. The DJIA is based on the price movements of 30 large companies; 
the S&P 500 is an index composed of 500 stocks. Some say the S&P 500 is a better 
measure of stock market performance because it is broader based. The closing price 
for the DJIA and the S&P 500 for 15 weeks, beginning with January 6, 2012, follow 
(Barron’s website). 


Date DJIA S&P 

January 6 12,360 1278 

January 13 12,422 1289 

January 20 12,720 11S 

January 27 12,660 1316 

: February 3 12,862 1345 

=< DATA file February 10 12,801 1343 
DJIAS&P500 February 17 12,950 1362 
February 24 12,983 1366 

March 2 12,978 1370 

March 9 12922 SAL 

March 16 197233 1404 

March 23 13,081 1397 

March 30 13212 1408 

April 5 13,060 1398 

April 13 12,850 1370 


Develop a scatter diagram with DJIA as the independent variable. 
Develop the estimated regression equation. 
Test for a significant relationship. Use a = .0S. 
. Did the estimated regression equation provide a good fit? Explain. 
Suppose that the closing price for the DJIA is 13,500. Predict the closing price for 
the S&P 500. 
f. Should we be concerned that the DJIA value of 13,500 used to predict the S&P 
500 value in part (e) is beyond the range of the data used to develop the estimated 
regression equation? 
56. Home Size and Price. Is the number of square feet of living space a good pre- 
dictor of a house’s selling price? The following data show the square footage 
and selling price for 15 houses in Winston Salem, North Carolina, in April 2015 
(Zillow.com). 


one se 
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Size Selling Price 
(1000s sq. ft.) ($1000s) 
1.26 AES) 
3.02 299.9 
1) 139.0 
0.91 45.6 
1.87 129 
œ . 2.63 274.9 
= DATA file 2.60 259.9 
— 2.27 177.0 
2.30 1750 
2.08 189.9 
Wel 95.0 
1.38 82.1 
1.80 169.0 
157 96.5 
1.45 114.9 

a. Develop a scatter diagram with square feet of living space as the independent 
variable and selling price as the dependent variable. What does the scatter diagram 
indicate about the relationship between the size of a house and the selling price? 

b. Develop the estimated regression equation that could be used to predict the selling 
price given the number of square feet of living space. 

c. At the .05 level, is there a significant relationship between the two variables? 

d. Use the estimated regression equation to predict the selling price of a 2000 square 
foot house in Winston Salem, North Carolina. 

e. Do you believe the estimated regression equation developed in part (b) will provide 
a good prediction of selling price of a particular house in Winston Salem, North 
Carolina? Explain. 

f. Would you be comfortable using the estimated regression equation developed in 
part (b) to predict the selling price of a particular house in Seattle, Washington? 
Why or why not? 

57. Online Education. One of the biggest changes in higher education in recent years has 
been the growth of online universities. The Online Education Database is an indepen- 
dent organization whose mission is to build a comprehensive list of the top accredited 
online colleges. The following table shows the retention rate (%) and the graduation 
rate (%) for 29 online colleges. 

Retention Graduation 
Rate (%) Rate (%) 
7 25 
51 25 
4 28 
29 22 
33 33 
ie 47 33 
© DATA file 63 34 
OnlineEdu 45 36 
60 36 


(continued) 
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Retention Graduation 
Rate (%) Rate (%) 
62 36 
67 36 
65 S7 
78 37 
75 38 
54 39 
45 41 
38 44 
51 45 
69 46 
60 47 
37 48 
63 50 
73 51 
78 52 
48 53 
OS 55 
68 56 
100 D 
100 61 


a. Develop a scatter diagram with retention rate as the independent variable. What 
does the scatter diagram indicate about the relationship between the two variables? 

b. Develop the estimated regression equation. 

c. Test for a significant relationship. Use a = .05. 

d. Did the estimated regression equation provide a good fit? 

58. Machine Maintenance. Jensen Tire & Auto is in the process of deciding whether to 
purchase a maintenance contract for its new computer wheel alignment and balancing 
machine. Managers feel that maintenance expense should be related to usage, and they 
collected the following information on weekly usage (hours) and annual maintenance 
expense (in hundreds of dollars). 


Weekly Usage Annual 
(hours) Maintenance Expense 
‘ 13 17.0 
S DATA file 7 Ya 
JensenTires 20 30.0 
28 S720 
32 47.0 
17 30.5 
24 3245 
31 39.0 
40 51.5 
38 40.0 


a. Develop the estimated regression equation that relates annual maintenance expense 
to weekly usage. 

b. Test the significance of the relationship in part (a) at a .05 level of significance. 

c. Jensen expects to use the new machine 30 hours per week. Develop a 95% predic- 
tion interval for the company’s annual maintenance expense. 
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DATA file 


AgeCost 


DATA file 


HoursPts 
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59. 


60. 


61. 


d. If the maintenance contract costs $3000 per year, would you recommend purchas- 
ing it? Why or why not? 

Bus Maintenance. The regional transit authority for a major metropolitan area wants 

to determine whether there is any relationship between the age of a bus and the annual 

maintenance cost. A sample of 10 buses resulted in the following data. 


Age of Bus (years) Maintenance Cost ($) 


350 
370 
480 
520 
590 
550 
750 
800 
790 
950 


aanaBtbBWNYNONND => 


Develop the least squares estimated regression equation. 

Test to see whether the two variables are significantly related with a = .05. 

Did the least squares line provide a good fit to the observed data? Explain. 

Develop a 95% prediction interval for the maintenance cost for a specific bus that is 
4 years old. 

Studying and Grades. A marketing professor at Givens College is interested in the 
relationship between hours spent studying and total points earned in a course. Data 
collected on 10 students who took the course last quarter follow. 


aoe Pf 


Hours Total 
Spent Studying Points Earned 
45 40 
30 35 
90 US) 
60 65 
105 90 
65 50 
90 90 
80 80 
55 45 
Ts 65 


a. Develop an estimated regression equation showing how total points earned is re- 
lated to hours spent studying. 

b. Test the significance of the model with a = .05. 

c. Predict the total points earned by Mark Sweeney. He spent 95 hours studying. 

d. Develop a 95% prediction interval for the total points earned by Mark Sweeney. 

Used Car Mileage and Price. The Toyota Camry is one of the best-selling cars in 

North America. The cost of a previously owned Camry depends upon many factors, 

including the model year, mileage, and condition. To investigate the relationship be- 

tween the car’s mileage and the sales price for a 2007 model year Camry, the follow- 

ing data show the mileage and sale price for 19 sales (PriceHub website). 
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Miles (1000s) Price ($1000s) 
22 16.2 
29 16.0 
36 13.8 
47 Wile 
S datafile - ne 
Camry Vi 129) 
73 le 
87 IGG 
92 ines 
101 10.8 
110 8.3 
28 125 
59 Wile! 
68 15.0 
68 122 
oA 1320 
42 15.6 
65 127 
110 8.3 

a. Develop a scatter diagram with the car mileage on the horizontal axis and the price 
on the vertical axis. 

b. What does the scatter diagram developed in part (a) indicate about the relationship 
between the two variables? 

c. Develop the estimated regression equation that could be used to predict the price 
($1000s) given the miles (1000s). 

d. Test for a significant relationship at the .05 level of significance. 

e. Did the estimated regression equation provide a good fit? Explain. 

f. Provide an interpretation for the slope of the estimated regression equation. 

g. Suppose that you are considering purchasing a previously owned 2007 Camry that 
has been driven 60,000 miles. Using the estimated regression equation developed in 
part (c), predict the price for this car. Is this the price you would offer the seller? 

CASE PROBLEM 1: MEASURING STOCK 

MARKET RISK 

One measure of the risk or volatility of an individual stock is the standard deviation of the 

total return (capital appreciation plus dividends) over several periods of time. Although 

the standard deviation is easy to compute, it does not take into account the extent to which 

the price of a given stock varies as a function of a standard market index, such as the S&P 

500. As a result, many financial analysts prefer to use another measure of risk referred to 

as beta. 

Betas for individual stocks are determined by simple linear regression. The dependent 

ry : variable is the total return for the stock and the independent variable is the total return for 
= DATAS ile the stock market.* For this case problem we will use the S&P 500 index as the meas- 


Beta ure of the total return for the stock market, and an estimated regression equation will be 


*Various sources use different approaches for computing betas. For instance, some sources subtract the return that 
could be obtained from a risk-free investment (e.g., T-bills) from the dependent variable and the independent vari- 
able before computing the estimated regression equation. Some also use different indexes for the total return of the 
stock market; for instance, Value Line computes betas using the New York Stock Exchange composite index. 
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developed using monthly data. The beta for the stock is the slope of the estimated regres- 
sion equation (b,). The data contained in the file Beta provides the total return (capital 
appreciation plus dividends) over 36 months for eight widely traded common stocks and 
the S&P 500. 

The value of beta for the stock market will always be 1; thus, stocks that tend to rise and 
fall with the stock market will also have a beta close to 1. Betas greater than | indicate that 
the stock is more volatile than the market, and betas less than | indicate that the stock is 
less volatile than the market. For instance, if a stock has a beta of 1.4, it is 40% more vola- 
tile than the market, and if a stock has a beta of .4, it is 60% less volatile than the market. 


Managerial Report 
You have been assigned to analyze the risk characteristics of these stocks. Prepare a report 
that includes but is not limited to the following items. 


a. Compute descriptive statistics for each stock and the S&P 500. Comment on your 
results. Which stocks are the most volatile? 

b. Compute the value of beta for each stock. Which of these stocks would you expect to 
perform best in an up market? Which would you expect to hold their value best in a 
down market? 

c. Comment on how much of the return for the individual stocks is explained by the market. 


CASE PROBLEM 2: U.S. DEPARTMENT OF 
TRANSPORTATION 


As part of a study on transportation safety, the U.S. Department of Transportation collec- 
ted data on the number of fatal accidents per 1000 licenses and the percentage of licensed 
drivers under the age of 21 in a sample of 42 cities. Data collected over a one-year period 
follow. These data are contained in the file Safety. 


Percent Fatal Accidents Percent Fatal Accidents 
Under 21 per 1000 Licenses Under 21 per 1000 Licenses 
13 2962 17 4.100 
12 0.708 8 25190 
8 0.885 16 3.623 
12 17652 Alls) 2.623 
=> : 14 21091 G 0.835 
= DATA file 17 21027. 8 0.820 
Safety 18 3.830 14 2.890 
8 0.368 8 1.267 
183 1142 15 3.224 
8 0.645 10 1.014 
9) 1.028 10 0.493 
16 2.801 14 1.443 
12 1.405 18 3.614 
©) 1.433 10 1.926 
10 0.039 14 1.643 
9 0.338 16 2943 
11 1.849 12 1.913 
12 2 246 15 2.814 
14 2255 13 2.634 
14 2952 G 0.926 
ql 1.294 Ad 3.256 
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Managerial Report 


1. Develop numerical and graphical summaries of the data. 

2. Use regression analysis to investigate the relationship between the number of fatal acci- 
dents and the percentage of drivers under the age of 21. Discuss your findings. 

3. What conclusion and recommendations can you derive from your analysis? 


CASE PROBLEM 3: SELECTING A POINT-AND- 
SHOOT DIGITAL CAMERA 


SOSHSSHSSHSHSHSHSHSHSHSHSHHSHSHSHSHSHSSHSSHHSHSHSHSHSHSHSHSHSHHAOSHHSHSOSSHHSSSHSHHSESSEEEEEEe 


Consumer Reports tested 166 different point-and-shoot digital cameras. Based upon fac- 

tors such as the number of megapixels, weight (oz), image quality, and ease of use, they 
developed an overall score for each camera tested. The overall score ranges from 0 to 100, 
with higher scores indicating better overall test results. Selecting a camera with many options 
can be a difficult process, and price is certainly a key issue for most consumers. By spend- 
ing more, will a consumer really get a superior camera? And, do cameras that have more 
megapixels, a factor often considered to be a good measure of picture quality, cost more than 
cameras with fewer megapixels? Table 14.15 shows the brand, average retail price ($), num- 
ber of megapixels, weight (oz), and the overall score for 13 Canon and 15 Nikon subcompact 
cameras tested by Consumer Reports (Consumer Reports website). 


TABLE 14.15 Data for 28 Point-and-Shoot Digital Cameras 


Price Weight 
Observation Brand ($) Megapixels (oz) Score 
1 Canon 330 10 7 66 
2 Canon 200 12 5 66 
3 Canon 300 12 7 65 
4 Canon 200 10 6 62 
S DATA file 5 Canon 180 12 5 62 
Cameras 6 Canon 200 12 7 61 
7 Canon 200 14 5 60 
8 Canon 130 10 7 60 
9 Canon 130 12 5 59 
10 Canon 110 16 5 55 
11 Canon 90 14 5 52 
12 Canon 100 10 6 51 
18 Canon 90 NZ 7 46 
14 Nikon 270 16 5 65 
15 Nikon 300 16 7 63 
16 Nikon 200 14 6 61 
17 Nikon 400 14 7 59 
18 Nikon 120 14 5 57 
19 Nikon 170 16 6 56 
20 Nikon 150 12 5 56 
21 Nikon 230 14 6 55 
22 Nikon 180 12 6 53 
23 Nikon 130 12 6 53 
24 Nikon 80 12 7 52 
25 Nikon 80 14 y 50 
26 Nikon 100 12 4 46 
27 Nikon 110 12 5 45 
28 Nikon 130 14 4 42 
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Managerial Report 


1. Develop numerical summaries of the data. 

2. Using overall score as the dependent variable, develop three scatter diagrams, one using 
price as the independent variable, one using the number of megapixels as the indepen- 
dent variable, and one using weight as the independent variable. Which of the three 
independent variables appears to be the best predictor of overall score? 

3. Using simple linear regression, develop an estimated regression equation that could be 
used to predict the overall score given the price of the camera. 

4. Analyze the data using only the observations for the Canon cameras. Discuss the appro- 
priateness of using simple linear regression and make any recommendations regarding 
the prediction of overall score using just the price of the camera. 


CASE PROBLEM 4: FINDING THE BEST 
CAR VALUE 


When trying to decide what car to buy, real value is not necessarily determined by how 
much you spend on the initial purchase. Instead, cars that are reliable and don’t cost much 
to own often represent the best values. But, no matter how reliable or inexpensive a car 
may cost to own, it must also perform well. 

To measure value, Consumer Reports developed a statistic referred to as a value score. 
The value score is based upon five-year owner costs, overall road-test scores, and predicted 
reliability ratings. Five-year owner costs are based on the expenses incurred in the first five 
years of ownership, including depreciation, fuel, maintenance and repairs, and so on. Using 
a national average of 12,000 miles per year, an average cost per mile driven is used as the 
measure of five-year owner costs. Road-test scores are the results of more than 50 tests and 
evaluations and are based upon a 100-point scale, with higher scores indicating better perfor- 
mance, comfort, convenience, and fuel economy. The highest road-test score obtained in the 
tests conducted by Consumer Reports was a 99 for a Lexus LS 460L. Predicted-reliability 
ratings (1 = Poor, 2 = Fair, 3 = Good, 4 = Very Good, and 5 = Excellent) are based on 
data from Consumer Reports’ Annual Auto Survey. 

A car with a value score of 1.0 is considered to be “average-value.” A car with a value 
score of 2.0 is considered to be twice as good a value as a car with a value score of 1.0; a 
car with a value score of 0.5 is considered half as good as average; and so on. The data for 
20 family sedans, including the price ($) of each car tested, follow. 


Road- 
Test Predicted Value 
Car Price ($) Cost/Mile Score Reliability Score 
Nissan Altima 2.5 S (4-cyl.) 23 9/0) O59 91 4 1.75 
Kia Optima LX (2.4) 21,885 0.58 81 4 1578 
Subaru Legacy 2.5i Premium 23,830 0.59 83 4 les 
Ford Fusion Hybrid 32,360 0.63 84 5 1.70 
Honda Accord LX-P (4-cyl.) 23,730 0.56 80 4 1.62 
Mazdaé i Sport (4-cyl.) 22,035 0.58 73 4 1.60 
S DATA file Hyundai Sonea GLS (2.4) 21,800 0.56 89 3 1.58 
FamilySedans Ford Fusion SE (4-cyl.) 23,625 0.57 76 4 155 
Chevrolet Malibu LT (4-cyl.) xl ALAS) 0:57 74 3 1.48 
Kia Optima SX (2.0T) 29,050 0.72 84 4 1.43 
Ford Fusion SEL (V6) 28,400 0.67 80 4 1.42 
Nissan Altima 3.5 SR (V6) 30/335 0.69 93 4 1.42 
(continued) 
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Road- 

Test Predicted Value 
Car Price ($) Cost/Mile Score Reliability Score 
Hyundai Sonata Limited (2.0T) 28,090 0.66 89 3 1.39 
Honda Accord EX-L (V6) 28,695 0.67 90 B 1.36 
Mazdaé s Grand Touring (V6) 30,790 0.74 81 4 1.34 
Ford Fusion SEL (V6, AWD) 30,055 0.71 75 4 132 
Subaru Legacy 3.6R Limited 30,094 0.71 88 3 129 
Chevrolet Malibu LTZ (V6) 28,045 0.67 83 3 1.20 
Chrysler 200 Limited (V6) 27825 0.70 52 5 1.20 
Chevrolet Impala LT (3.6) 28,995 0.67 63 3 1.05 


Managerial Report 


1. Develop numerical summaries of the data. 

2. Use regression analysis to develop an estimated regression equation that could be used 

to predict the value score given the price of the car. 

3. Use regression analysis to develop an estimated regression equation that could be used 

to predict the value score given the five-year owner costs (cost/mile). 

4. Use regression analysis to develop an estimated regression equation that could be used 

to predict the value score given the road-test score. 

5. Use regression analysis to develop an estimated regression equation that could be used 
to predict the value score given the predicted-reliability. 

6. What conclusions can you derive from your analysis? 


CASE PROBLEM 5: BUCKEYE CREEK 
AMUSEMENT PARK 


Buckeye Creek Amusement Park is open from the beginning of May to the end of October. 
Buckeye Creek relies heavily on the sale of season passes. The sale of season passes brings 
in significant revenue prior to the park opening each season, and season pass holders con- 
tribute a substantial portion of the food, beverage, and novelty sales in the park. Greg Ross, 
director of marketing at Buckeye Creek, has been asked to develop a targeted marketing 
campaign to increase season pass sales. 
Greg has data for last season that show the number of season pass holders for each 
zip code within 50 miles of Buckeye Creek. He has also obtained the total population of 
each zip code from the U.S. Census Bureau website. Greg thinks it may be possible to use 
regression analysis to predict the number of season pass holders in a zip code given the total 
DATA fi le population of a zip code. If this is possible, he could then conduct a direct mail campaign 
that would target zip codes that have fewer than the expected number of season pass holders. 


(@ 


BuckeyeCreek 


Managerial Report 


1. Compute descriptive statistics and construct a scatter diagram for the data. Discuss your 
findings. 

2. Using simple linear regression, develop an estimated regression equation that could be 

used to predict the number of season pass holders in a zip code given the total popula- 

tion of the zip code. 

Test for a significant relationship at the .05 level of significance. 

Did the estimated regression equation provide a good fit? 

Use residual analysis to determine whether the assumed regression model is appropriate. 

Discuss if/how the estimated regression equation should be used to guide the marketing 

campaign. 

7. What other data might be useful to predict the number of season pass holders in a zip code? 


NA PO 
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Chapter 14 Appendix 


Appendix 14.1: Calculus-Based Derivation of Least 
Squares Formulas 


As mentioned in the chapter, the least squares method is a procedure for determining 
the values of bọ and b, that minimize the sum of squared residuals. The sum of squared 
residuals is given by 
20; - gy 
Substituting ŷ; = b, + b,x;, we get 
=O; — 49 — bx)? (14.34) 
as the expression that must be minimized. 


To minimize expression (14.34), we must take the partial derivatives with respect to by 
and b,, set them equal to zero, and solve. Doing so, we get 


IX; — by — bx)? 


= —23:0; — ba — bx) = 0 (14.35) 
db, 
IDd(y, — by — b,x)? 
žo: oF. ar ~ 22x; — bo — bx) = 0 (14.36) 
1 


Dividing equation (14.35) by two and summing each term individually yields 
— Dy, + Dd + Èb = 0 
Bringing Èy; to the other side of the equal sign and noting that $b) = nbọ, we obtain 


nby + (Èx;)bi = Dy; (14.37) 
Similar algebraic simplification applied to equation (14.36) yields 
(Sx)by + xb = Sxy; (14.38) 
Equations (14.37) and (14.38) are known as the normal equations. Solving equation 
(14.37) for by yields 
Xy; Dx, 
b) = i bi 7 (14.39) 
Using equation (14.39) to substitute for b) in equation (14.38) provides 
XY; ¢ ay 
PAB SAM 5 + Dh, = Say (14.40) 


By rearranging the terms in equation (14.40), we obtain 


ÈY Ox Èy)/n Baxo) 


b 14.41 
l ER- aie Ea -D a 

Because y = Sy,/n and x = $x;/n, we can rewrite equation (14.39) as 
b =y — bx (14.42) 


Equations (14.41) and (14.42) are the formulas (14.6) and (14.7) we used in the chapter to 
compute the coefficients in the estimated regression equation. 
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Appendix 14.2: A Test for Significance Using Correlation 


Using the sample correlation coefficient r,,, we can determine whether the linear relation- 
ship between x and y is significant by testing the following hypotheses about the popula- 
tion correlation coefficient p,,. 

Hy py = 0 

Hx Py 0 
If Ho is rejected, we can conclude that the population correlation coefficient is not equal to 


zero and that the linear relationship between the two variables is significant. This test for 
significance follows. 


A TEST FOR SIGNIFICANCE USING CORRELATION 


Hy Pay ~ 
laka Day AN 
TEST STATISTIC 
n=2 
REJECTION RULE 
p-value approach: Reject H, if p-value = a 


Critical value approach: Reject H, if t = —t,,. or if t = tyn 


where t„n is based on a ¢ distribution with n — 2 degrees of freedom. 


In Section 14.3, we found that the sample with n = 10 provided the sample correlation 
coefficient for student population and quarterly sales of r,, = .9501. The test statistic is 


n-2 10-2 
t=n, = 9501 ,/-———,, = 8.61 
N1- e, 1 — (.9501) 


The ż distribution table shows that with n — 2 = 10 — 2 = 8 degrees of freedom, t = 3.355 
provides an area of .005 in the upper tail. Thus, the area in the upper tail of the ¢ distribution cor- 
responding to the test statistic t = 8.61 must be less than .005. Because this test is a two-tailed 
test, we double this value to conclude that the p-value associated with tf = 8.61 must be less 
than 2(.005) = .01. Using Excel, the p-value = .000. Because the p-value is less than a = .01, 
we reject H, and conclude that p,, is not equal to zero. This evidence is sufficient to conclude 
that a significant linear relationship exists between student population and quarterly sales. 

Note that except for rounding, the test statistic t and the conclusion of a significant 
relationship are identical to the results obtained in Section 14.5 for the f test conducted 
using Armand’s estimated regression equation ý = 60 + 5x. Performing regression ana- 
lysis provides the conclusion of a significant relationship between x and y and in addition 
provides the equation showing how the variables are related. Most analysts therefore use 
modern computer packages to perform regression analysis and find that using correlation as 
a test of significance is unnecessary. 
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Chapter 15 Multiple Regression 


STATISTICS IN PRACTICE 


International Paper* 
PURCHASE, NEW YORK 


International Paper is the world’s largest paper and 
forest products company. The company employs 
more than 117,000 people in its operations in nearly 
50 countries, and exports its products to more than 
130 nations. International Paper produces building 
materials such as lumber and plywood; consumer 
packaging materials such as disposable cups and con- 
tainers; industrial packaging materials such as corru- 
gated boxes and shipping containers; and a variety of 
papers for use in photocopiers, printers, books, and 
advertising materials. 

To make paper products, pulp mills process wood 
chips and chemicals to produce wood pulp. The wood 
pulp is then used at a paper mill to produce paper 
products. In the production of white paper products, 
the pulp must be bleached to remove any discolor- 
ation. A key bleaching agent used in the process is 
chlorine dioxide, which, because of its combustible 
nature, is usually produced at a pulp mill facility and 
then piped in solution form into the bleaching tower 
of the pulp mill. To improve one of the processes used 
to produce chlorine dioxide, researchers studied the 
process's control and efficiency. One aspect of the 
study looked at the chemical feed rate for chlorine 
dioxide production. 

To produce the chlorine dioxide, four chemicals flow 
at metered rates into the chlorine dioxide generator. 
The chlorine dioxide produced in the generator flows 
to an absorber where chilled water absorbs the chlorine 
dioxide gas to form a chlorine dioxide solution. The 
solution is then piped into the paper mill. A key part of 


controlling the process involves the chemical feed rates. 


Historically, experienced operators set the chemical 
feed rates, but this approach led to overcontrol by the 
operators. Consequently, chemical engineers at the mill 


*The authors are indebted to Marian Williams and Bill Griggs for 
providing this Statistics in Practice. This application was originally 
developed at Champion International Corporation, which became part 
of International Paper in 2000. 


Multiple regression analysis assisted in the development 
of a better bleaching process for making white paper 
products. RGB Ventures/SuperStock/Alamy Stock Photo 


requested that a set of control equations, one for each 
chemical feed, be developed to aid the operators in 
setting the rates. 

Using multiple regression analysis, statistical analysts 
developed an estimated multiple regression equation 
for each of the four chemicals used in the process. Each 
equation related the production of chlorine dioxide to 
the amount of chemical used and the concentration 
level of the chlorine dioxide solution. The resulting set 
of four equations was programmed into a computer 
at each mill. In the new system, operators enter the 
concentration of the chlorine dioxide solution and the 
desired production rate; the computer software then 
calculates the chemical feed needed to achieve the 
desired production rate. After the operators began 
using the control equations, the chlorine dioxide 
generator efficiency increased, and the number of 
times the concentrations fell within acceptable ranges 
increased significantly. 

This example shows how multiple regression analysis 
can be used to develop a better bleaching process for 
producing white paper products. In this chapter, we will 
show how Excel can be used for such purposes. Most of 
the concepts introduced in Chapter 14 for simple linear 
regression can be directly extended to the multiple 
regression case. 


In Chapter 14 we presented simple linear regression and demonstrated its use in develop- 
ing an estimated regression equation that describes the relationship between two variables. 
Recall that the variable being predicted or explained is called the dependent variable 

and the variable being used to predict or explain the dependent variable is called the 
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independent variable. In this chapter we continue our study of regression analysis by con- 
sidering situations involving two or more independent variables. This subject area, called 
multiple regression analysis, enables us to consider more factors and thus obtain better 
predictions than are possible with simple linear regression. 


15.1 Multiple Regression Model 


Multiple regression analysis is the study of how a dependent variable y is related to two 
or more independent variables. In the general case, we will use p to denote the number of 
independent variables. 


Regression Model and Regression Equation 


The concepts of a regression model and a regression equation introduced in the preceding 
chapter are applicable in the multiple regression case. The equation that describes how the 
dependent variable y is related to the independent variables x,, x3, . . . , x, and an error term 
is called the multiple regression model. We begin with the assumption that the multiple 
regression model takes the following form. 


MULTIPLE REGRESSION MODEL 
Y= [Bly Er ae [Brey ae eo Se bee ae (15.1) 


In the multiple regression model, Bo, 8), B2 . .., B, are the parameters and the error term 
e (the Greek letter epsilon) is a random variable. A close examination of this model reveals 
that y is a linear function of x), x,...,%, (the By + Bix, + Box, + -++ + B,x, part) plus 
the error term e. The error term accounts for the variability in y that cannot be explained by 
the linear effect of the p independent variables. 

In Section 15.4 we will discuss the assumptions for the multiple regression model and e. 
One of the assumptions is that the mean or expected value of € is zero. A consequence 
of this assumption is that the mean or expected value of y, denoted E(y), is equal to 
Bo + Bix, + Box, + +++ + B,x,. The equation that describes how the mean value of y is 
related to x), X2, . . . , x, is called the multiple regression equation. 


MULTIPLE REGRESSION EQUATION 
KG) = [ehy ae [eee ar (Eee cr 2 2 ap Boxe (15.2) 


Estimated Multiple Regression Equation 


If the values of Bo, B,, B2 - - - , B, were known, equation (15.2) could be used to compute 
the mean value of y at given values of x), x3, ..., x,. Unfortunately, these parameter values 
will not, in general, be known and must be estimated from sample data. A simple ran- 

dom sample is used to compute sample statistics bo, b4, b>, ..., b, that are used as the point 
estimators of the parameters Bo, 8), B>,..., Bp. These sample statistics provide the follow- 
ing estimated multiple regression equation. 


ESTIMATED MULTIPLE REGRESSION EQUATION 
Vb) bpa Op ee (15.3) 
where 
bo, bi, bz, .. . , b, are the estimates of Bo, B1, Bz - - - » Bp 


$ = predicted value of the dependent variable 
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FIGURE 15.1 The Estimation Process for Multiple Regression 


Multiple Regression 
Model 


y=ß + Bx,+ pat +P [Bodin er E 
Multiple Regression Equation 
EO) = 8, + Bx Box, +--+ Bp%p 


By, By, Byy Bp ate 


unknown parameters 


. f . Sample Data: 
In simple linear regression, 


b, and b, were the sample 
statistics used to estimate 
the parameters B, and B4. 


Multiple regression 


parallels this statistical 
inference process, with by, 
b;, by, ...,b, denoting the 
sample statistics used 

to estimate the parameters 


Bo, Bi, Ba- Bo. 
f Compute the Estimated 
Multiple Regression 
Equation 


Dp bp bpb 


provide the estimates of 


Bape eee 


ve by+ bx, + b,x,+ +d Xp 
by b,» by e Or are 


sample statistics 


The estimation process for multiple regression is shown in Figure 15.1. 


15.2 Least Squares Method 


In Chapter 14, we used the least squares method to develop the estimated regression 
equation that best approximated the straight-line relationship between the dependent and 
independent variables. This same approach is used to develop the estimated multiple re- 
gression equation. The least squares criterion is restated as follows. 


LEAST SQUARES CRITERION 
min Èy; — $Y (15.4) 
where 
y; = observed value of the dependent variable for the ith observation 


y 


predicted value of the dependent variable for the ith observation 


The predicted values of the dependent variable are computed by using the estimated mul- 
tiple regression equation, 


Y= by + bix + box, + 0° + DX, 


As expression (15.4) shows, the least squares method uses sample data to provide the 
values of bo, b,, by, . . . , b, that make the sum of squared residuals (the deviations between 
the observed values of the dependent variable (y;) and the predicted values of the dependent 
variable (¥,)) a minimum. 

In Chapter 14 we presented formulas for computing the least squares estimators by and 
b, for the estimated simple linear regression equation ŷ = by + b,x. With relatively small 
data sets, we were able to use those formulas to compute by and b, by manual calculations. 
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In multiple regression, however, the presentation of the formulas for the regression coeffi- 
cients bo, bi; b,,..., b, involves the use of matrix algebra and is beyond the scope of this 
text. Therefore, in presenting multiple regression, we focus on how computer software 
packages can be used to obtain the estimated regression equation and other information. 
The emphasis will be on how to interpret the computer output rather than on how to make 
the multiple regression computations. 


An Example: Butler Trucking Company 


As an illustration of multiple regression analysis, we will consider a problem faced by the 
Butler Trucking Company, an independent trucking company in southern California. A 
major portion of Butler’s business involves deliveries throughout its local area. To develop 
better work schedules, the managers want to predict the total daily travel time for their 
drivers. 

Initially the managers believed that the total daily travel time would be closely related 
to the number of miles traveled in making the daily deliveries. A simple random sample 
of 10 driving assignments provided the data shown in Table 15.1 and the scatter diagram 
shown in Figure 15.2. After reviewing this scatter diagram, the managers hypothesized 
that the simple linear regression model y = By + B,x, + € could be used to describe the 
relationship between the total travel time (y) and the number of miles traveled (x,). To esti- 
mate the parameters 8, and £, the least squares method was used to develop the estimated 
regression equation. 


$= b, + bix, (15.5) 


In Figure 15.3, we show the Excel Regression tool output’ from applying simple linear 
regression to the data in Table 15.1. The estimated regression equation is 


$ = 1.2739 + .0678x, 


At the .05 level of significance, the F value of 15.8146 and its corresponding p-value of 
.0041 indicate that the relationship is significant; that is, we can reject Hy: B4 = 0 because 
the p-value is less than a = .05. Note that the same conclusion is obtained from the ¢ value 
of 3.9768 and its associated p-value of .0041. Thus, we can conclude that the relationship 
between the total travel time and the number of miles traveled is significant; longer travel 
times are associated with more miles traveled. With a coefficient of determination of 


TABLE 15.1 Preliminary Data for Butler Trucking 


Driving x, = Miles y = Travel Time 
Assignment Traveled (hours) 

í 100 93 

KEA f 2 50 4.8 
LA i 

= DATA file ape a 

Butler 4 100 as 

5 50 4.2 

6 80 6.2 

7 Tis) yA 

8 65 6.0 

9 90 VEO 

10 90 6.1 


'Excel’s Regression tool was used to obtain the output. Section 14.7 describes how to use Excel’s Regression tool 
for simple linear regression. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


690 Chapter 15 Multiple Regression 


FIGURE 15.2 Scatter Diagram of Preliminary Data for Butler Trucking 
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FIGURE 15.3 Regression Tool Output for Butler Trucking with One Independent Variable 


A B c D E F G H I J 

1 Miles Time 

2 i 100 93 
3 2 50 48 
4 3 100 8.9 
$ 4 100 6.5 
6 5 50 42 
7 6 80 6.2 
8 7 15 14 
9 8 65 6 
10 9 90 7.6 

90 
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TABLE 15.2 Data for Butler Trucking with Miles Traveled (x,) and Number 
of Deliveries (x,) as the Independent Variables 


Driving x, = Miles x, = Number y = Travel Time 
Assignment Traveled of Deliveries (hours) 


5 


100 
50 
100 
100 
50 
80 
75 


oS 
4.8 
8.9 
6.5 
4.2 
6.2 
7.4 
6.0 
7.6 
6.1 


(@ 


DATA file 


Butler 


OODOOAN AO UO BWD > 
NWR WNHND AB W 


25 


R Square = .6641, we see that 66.41% of the variability in travel time can be explained by 
the linear effect of the number of miles traveled. This finding is fairly good, but the man- 
agers might want to consider adding a second independent variable to explain some of the 
remaining variability in the dependent variable. 

In attempting to identify another independent variable, the managers felt that the 
number of deliveries could also contribute to the total travel time. The Butler Trucking 
data, with the number of deliveries added, are shown in Table 15.2. To develop the esti- 
mated multiple regression equation with both miles traveled (x,) and number of deliveries 
(x) as independent variables, we will use Excel’s Regression tool. 


Using Excel's Regression Tool to Develop the Estimated 


Multiple Regression Equation 


In Section 14.7 we showed how Excel’s Regression tool could be used to determine the 
estimated regression equation for Armand’s Pizza Parlors. We can use the same procedure 
with minor modifications to develop the estimated multiple regression equation for Butler 
Trucking. Refer to Figures 15.4 and 15.5 as we describe the tasks involved. 


= DATA file Enter/Access Data: Open the file Butler. The data are in cells B2:D11 and labels are in 


column A and cells B1:D1. 
Butler 


Apply Tools: The following steps describe how to use Excel’s Regression tool for multiple 
regression analysis. 


Step 1. Click the DATA tab on the Ribbon 

Step 2. In the Analyze group, click Data Analysis 

Step 3. Choose Regression from the list of Analysis Tools 

Step 4. When the Regression dialog box appears (see Figure 15.4): 
Enter D/:D/1/ in the Input Y Range: box 
Enter B1:C11 in the Input X Range: box 
Select the check box for Labels 
Select the check box for Confidence Level 
Enter 99 in the Confidence Level: box 
Select Output Range: in the Output options area 
Enter A/3 in the Output Range: box (to identify the upper left corner of the 

section of the worksheet where the output will appear) 

Click OK 
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FIGURE 15.4 | Regression Tool Dialog Box for the Butler Trucking Example 
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In the Excel output shown in Figure 15.5 the label for the independent variable x, is Miles 
(see cell A30), and the label for the independent variable x, is Deliveries (see cell A31). 
The estimated regression equation is 


$ = —.8687 + .0611x, + .9234x, (15.6) 


Note that using Excel’s Regression tool for multiple regression is almost the same as using 
it for simple linear regression. The major difference is that in the multiple regression case a 
larger range of cells has to be provided in order to identify the independent variables. 

In the next section we will discuss the use of the coefficient of multiple determination in 
measuring how good a fit is provided by this estimated regression equation. Before doing so, 
let us examine more carefully the values of b, = .0611 and b, = .9234 in equation (15.6). 


Note on Interpretation of Coefficients 


One observation can be made at this point about the relationship between the estimated 
regression equation with only the miles traveled as an independent variable and the equation 
that includes the number of deliveries as a second independent variable. The value of b, is not 
the same in both cases. In simple linear regression, we interpret b, as an estimate of the change 
in y for a one-unit change in the independent variable. In multiple regression analysis, this in- 
terpretation must be modified somewhat. That is, in multiple regression analysis, we interpret 
each regression coefficient as follows: b, represents an estimate of the change in y correspond- 
ing to a one-unit change in x; when all other independent variables are held constant. In the 
Butler Trucking example involving two independent variables, b, = .0611. Thus, .0611 hours 
is an estimate of the expected increase in travel time corresponding to an increase of 1 mile 

in the distance traveled when the number of deliveries is held constant. Similarly, because 

b, = .9234, an estimate of the expected increase in travel time corresponding to an increase of 
one delivery when the number of miles traveled is held constant is .9234 hours. 


EXERCISES 


Note to student: The exercises involving data in this and subsequent sections were designed 
to be solved using a computer software package. 


Methods 
1. The estimated regression equation for a model involving two independent variables 
and 10 observations follows. 


$ = 29.1270 + .5906x, + .4980x, 


a. Interpret b, and b, in this estimated regression equation. 
b. Predict y when x, = 180 and x, = 310. 
2. Consider the following data for a dependent variable y and two independent variables, 


x, and x,. 
x X2 y 
30 12 94 
47 10 108 
SS i 25 We 12 
xA 
= DATA file 51 16 178 
Exer2 40 5 94 
5i 19 175 
74 7 170 
36 12 117 
59, i} 142 
76 16 Zan 
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a. Develop an estimated regression equation relating y to x,. Predict y if x, = 45. 
b. Develop an estimated regression equation relating y to x,. Predict y if x, = 15. 
c. Develop an estimated regression equation relating y to x, and x,. Predict y if 
x, = 45 and x, = 15. 
3. In a regression analysis involving 30 observations, the following estimated regression 
equation was obtained. 


y = 17.6 + 3.8x, — 2.3x, + 7.6x, + 2.7%x, 


a. Interpret b,, b,, b}, and b, in this estimated regression equation. 
b. Predict y when x, = 10, x, = 5, x, = 1, and x, = 2. 


Applications 
4. Forecasting Shoe Sales. A shoe store developed the following estimated regression 
equation relating sales to inventory investment and advertising expenditures. 


y = 25 + 10x, + 8x, 


where 
x, = inventory investment ($1000s) 
x, = advertising expenditures ($1000s) 
y = sales ($1000s) 


a. Predict the sales resulting from a $15,000 investment in inventory and an advertis- 
ing budget of $10,000. 
b. Interpret b and b, in this estimated regression equation. 
5. Predicting Theater Revenue. The owner of Showtime Movie Theaters, Inc. would 
like to predict weekly gross revenue as a function of advertising expenditures. 
Historical data for a sample of eight weeks follow. 


Weekly Television Newspaper 
Gross Revenue Advertising Advertising 
($1000s) ($1000s) ($1000s) 

96 5.0 S 
90 2.0 2.0 
S DATA file 95 40 15 
Showtime 92 ZS 215 
25 3.0 33 
94 35 23 
94 25 4.2 
94 3.0 25 


a. Develop an estimated regression equation with the amount of television advertising 
as the independent variable. 

b. Develop an estimated regression equation with both television advertising and 
newspaper advertising as the independent variables. 

c. Is the estimated regression equation coefficient for television advertising expendit- 
ures the same in part (a) and in part (b)? Interpret the coefficient in each case. 

d. Predict weekly gross revenue for a week when $3500 is spent on television advert- 
ising and $2300 is spent on newspaper advertising? 

6. Predicting NFL Wins. The National Football League (NFL) records a variety 

of performance data for individuals and teams. To investigate the importance of 

passing on the percentage of games won by a team, the following data show the 

conference (Conf), average number of passing yards per attempt (Yds/Att), the 
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number of interceptions thrown per attempt (Int/Att), and the percentage of games 
won (Win%) for a random sample of 16 NFL teams for one full season. 


Team Conf Yds/Att Int/Att Win% 
Arizona Cardinals NFC 6.5 .042 50.0 
Atlanta Falcons NFC Tal .022 62.5 
Carolina Panthers NFC 7.4 .033 375 
Cincinnati Bengals AFC 6.2 .026 56.3 
Detroit Lions NFC V2 .024 025 
=> ; Green Bay Packers NFC 8.9 .014 93.8 
= DATA file Houston ee AFC WS .019 62.5 
PassingNFL Indianapolis Colts AFC 5 .026 125 
Jacksonville Jaguars AFC 4.6 .032 Sil} 
Minnesota Vikings NFC 5.8 1033 18.8 
New England Patriots AFC 8.3 .020 81.3 
New Orleans Saints NEC 8.1 021 813 
Oakland Raiders AFC 7.6 .044 50.0 
San Francisco 49ers NFC 6.5 011 SS 
Tennessee Titans AFC 6.7 .024 56.3 
Washington Redskins NFC 6.4 041 Silks 

a. Develop the estimated regression equation that could be used to predict the percent- 
age of games won given the average number of passing yards per attempt. 

b. Develop the estimated regression equation that could be used to predict the 
percentage of games won given the number of interceptions thrown per attempt. 

c. Develop the estimated regression equation that could be used to predict the percent- 
age of games won given the average number of passing yards per attempt and the 
number of interceptions thrown per attempt. 

d. The average number of passing yards per attempt for the Kansas City Chiefs was 
6.2 and the number of interceptions thrown per attempt was .036. Use the estimated 
regression equation developed in part (c) to predict the percentage of games won by 
the Kansas City Chiefs. (Note: For this season the Kansas City Chiefs’ record was 
7 wins and 9 losses.) Compare your prediction to the actual percentage of games 
won by the Kansas City Chiefs. 

Ss DATA file 7. Federal Employee Satisfaction. The United States Office of Personnel Manage- 
— ment (OPM) manages the civil service of the federal government. Results from its 


Satisfaction annual Federal Employee Viewpoint Survey (FEVS) are used to measure employee 


satisfaction on several work aspects, including Job Satisfaction, Pay Satisfaction, 

Organization Satisfaction, and an overall measure of satisfaction referred to as 

Global Satisfaction. In each case a 100-point scale with higher values indicat- 

ing greater satisfaction is used (OPM website). Scores for Global Satisfaction, 

Job Satisfaction, Pay Satisfaction, and Organization Satisfaction for a sample of 

65 employees are provided in the file Satisfaction. 

a. Develop the estimated multiple regression equation that can be used to predict the 
Global Satisfaction score using the Job Satisfaction, Pay Satisfaction, and Organi- 
zation Satisfaction scores. 

b. Predict the overall Global Satisfaction score for an employee with a Job Satisfac- 
tion score of 72, a Pay Satisfaction score of 54, and an Organization Satisfaction 
score of 53. 

8. Scoring Cruise Ships. The Condé Nast Traveler Gold List provides ratings for 

the top 20 small cruise ships. The following data shown are the scores each ship 

received based upon the results from Condé Nast Traveler’ s annual Readers’ 
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Choice Survey. Each score represents the percentage of respondents who rated a 
ship as excellent or very good on several criteria, including Shore Excursions and 
Food/Dining. An overall score is also reported and used to rank the ships. The 
highest ranked ship, the Seabourn Odyssey, has an overall score of 94.4, the highest 
component of which is 97.8 for Food/Dining. 


Shore 
Ship Overall Excursions Food/Dining 
Seabourn Odyssey 94.4 90.9 97.8 
Seabourn Pride 93.0 84.2 96.7 
National Geographic Endeavor 929 100.0 88.5 
Seabourn Sojourn giS 94.8 OW 
=e . Paul Gauguin 2015 87.9 Sl A 
= DATA file Seabourn Legend 90.3 82.1 98.8 
Ships Seabourn Spirit 90.2 86.3 92.0 
Silver Explorer 89.9 926 88.9 
Silver Spirit 89.4 85.9 90.8 
Seven Seas Navigator 89.2 833 90.5 
Silver Whisperer 89.2 82.0 88.6 
National Geographic Explorer 89.1 Beil 89.7 
Silver Cloud 88.7 78.3 AS 
Celebrity Xpedition 8702 Med! 73.6 
Silver Shadow 87.2 75.0 89.7 
Silver Wind 86.6 Ten 91.6 
SeaDream II 86.2 77.4 90.9 
Wind Star 86.1 76.5 1) 
Wind Surf 86.1 T23 89.3 
Wind Spirit 85.2 774 F 

a. Determine an estimated regression equation that can be used to predict the overall 
score given the score for Shore Excursions. 

b. Consider the addition of the independent variable Food/Dining. Develop the es- 
timated regression equation that can be used to predict the overall score given the 
scores for Shore Excursions and Food/Dining. 

c. Predict the overall score for a cruise ship with a Shore Excursions score of 80 and a 
Food/Dining Score of 90. 

9. House Prices. Spring is a peak time for selling houses. The file SpringHouses con- 
tains the selling price, number of bathrooms, square footage, and number of bedrooms 
of 26 homes sold in Ft. Thomas, Kentucky, in spring 2018 (realtor.com website). 

a. Develop scatter plots of selling price versus number of bathrooms, selling price ver- 
sus square footage, and selling price versus number of bedrooms. Comment on the 
relationship between selling price and these three variables. 

A DATA fil b. Develop an estimated regression equation that can be used to predict the selling 
= i IE price given the three independent variables (number of baths, square footage, and 
SpringHouses number of bedrooms). 


c. Itis argued that we do not need both number of baths and number of bedrooms. 
Develop an estimated regression equation that can be used to predict selling price 
given square footage and the number of bedrooms. 

d. Suppose your house has four bedrooms and is 2650 square feet. What is the pre- 
dicted selling price using the model developed in part (c). 

10. Predicting Baseball Pitcher Performance. Major League Baseball (MLB) consists 

of teams that play in the American League and the National League. MLB collects a 
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wide variety of team and player statistics. Some of the statistics often used to evaluate 
pitching performance are as follows: 


ERA: The average number of earned runs given up by the pitcher per nine innings. 
An earned run is any run that the opponent scores off a particular pitcher except for 
runs scored as a result of errors. 


SO/IP: The average number of strikeouts per inning pitched. 
HR/IP: The average number of home runs per inning pitched. 
R/IP: The number of runs given up per inning pitched. 


The following data show values for these statistics for a random sample of 20 pitchers 
from the American League for a full season. 


Player Team Ww L ERA SO/IP HR/IP R/IP 

Verlander, J DIET 24 5 2.40 1.00 AO 29. 

Beckett, J BOS 13 7 2.89 91 sl 34 

Wilson, C TEX 16 7 2.94 92 .07 .40 

Sabathia, C NYY 19 8 3.00 97 107 37 

Haren, D LAA 16 10 Sy il 7 81 .08 38 

McCarthy, B OAK 9 9 3.32 T2 .06 43 

Santana, E LAA ld 12 338 78 S 42 

> K Lester, J BOS 15 9 3.47 95 10 40 
SS DATA file Hernandez, F SEA 4. A a7 95 08 42 
PitchingMLB Buehrle, M CWS 18 9 3.59 58 .10 .45 
Pineda, M SEA 9 10 374 1.01 oll 44 

Colon, B NYY 8 10 4.00 82 ald} 52 

Tomlin, J GLE 12 i A125 .54 H5 .48 

Pavano, C MIN 9 13 4.30 .46 10 55 

Danks, J CWS 8 12 A33 LY all 52 

Guthrie, J BAL 9 17 4.33 .63 a8} .54 

Lewis, C TEX 14 10 4.40 84 sl Sil 

Scherzer, M DIET 15 9 4.43 .89 15 52 

Davis, W TB 11 10 4.45 57 ml) 152 

Porcello, R DET 14 2 4.75 157 10 [57 


a. Develop an estimated regression equation that can be used to predict the average 
number of runs given up per inning given the average number of strikeouts per 
inning pitched. 

b. Develop an estimated regression equation that can be used to predict the average 
number of runs given up per inning given the average number of home runs per 
inning pitched. 

c. Develop an estimated regression equation that can be used to predict the average 
number of runs given up per inning given the average number of strikeouts per 
inning pitched and the average number of home runs per inning pitched. 

d. A. J. Burnett, a pitcher for the New York Yankees, had an average number of 
strikeouts per inning pitched of .91 and an average number of home runs per inning 
of .16. Use the estimated regression equation developed in part (c) to predict the 
average number of runs given up per inning for A. J. Burnett. (Note: The actual 
value for R/IP was .6.) 

e. Suppose a suggestion was made to also use the earned run average as another inde- 
pendent variable in part (c). What do you think of this suggestion? 
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15.3 Multiple Coefficient of Determination 


In simple linear regression we showed that the total sum of squares can be partitioned into 
two components: the sum of squares due to regression and the sum of squares due to error. 
The same procedure applies to the sum of squares in multiple regression. 


RELATIONSHIP AMONG SST, SSR, AND SSE 
SST = SSR + SSE (15.7) 
where 


SST = total sum of squares = X(y, — y}? 
SSR = sum of squares due to regression = X; — Y)? 
SSE = sum of squares due to error = ÈQ; — f) 


Because of the computational difficulty in computing the three sums of squares, we rely 
on computer packages to determine those values. The analysis of variance part of the Excel 
output in Figure 15.5 shows the three values for the Butler Trucking problem with two 
independent variables: SST = 23.9, SSR = 21.6006, and SSE = 2.2994. With only one 
independent variable (number of miles traveled), the Excel output in Figure 15.3 shows 
that SST = 23.9, SSR = 15.8713, and SSE = 8.0287. The value of SST is the same in 
both cases because it does not depend on 9, but SSR increases and SSE decreases when a 
second independent variable (number of deliveries) is added. The implication is that the 
estimated multiple regression equation provides a better fit for the observed data. 

In Chapter 14 we used the coefficient of determination, r? = SSR/SST, to measure the 
goodness of fit for the estimated regression equation. The same concept applies to mul- 
tiple regression. The term multiple coefficient of determination indicates that we are 
measuring the goodness of fit for the estimated multiple regression equation. The multiple 
coefficient of determination, denoted R?, is computed as follows. 


In the Excel Regression 
tool output the label R MULTIPLE COEFFICIENT OF DETERMINATION 
Square is used to identify R= SSR 


the value of R°. F SST (15.8) 


The multiple coefficient of determination can be interpreted as the proportion of the vari- 
ability in the dependent variable that can be explained by the estimated multiple regression 
equation. Hence, when multiplied by 100, it can be interpreted as the percentage of the 
variability in y that can be explained by the estimated regression equation. 

In the two-independent-variable Butler Trucking example, with SSR = 21.6006 and 
SST = 23.9, we have 


2 _ 21.6006 _ 
R 239 .9038 
Therefore, 90.38% of the variability in travel time y is explained by the estimated mul- 
Adding independent tiple regression equation with miles traveled and number of deliveries as the independent 
variables causes the variables. In Figure 15.5, we see that the multiple coefficient of determination is also 
prediction errors (residuals) provided by the Excel output; it is denoted by R Square = .9038 (see cell B17). 
to become smaller, thus re- Figure 15.3 shows that the R Square value for the estimated regression equation with only 


ducing the sum of squares one independent variable, number of miles traveled (x), is .6641. Thus, the percentage of the 
due to error, SSE. Because variability in travel time that is explained by the estimated regression equation increases from 
SSR = SST — SSE, when 66.41% to 90.38% when number of deliveries is added as a second independent variable. In 
SSE becomes smaller, SSR: general, R? always increases as independent variables are added to the model. 

becomes larger, causing Many analysts prefer adjusting R? for the number of independent variables to avoid 

R? = SSR/SST to increase. overestimating the impact of adding an independent variable on the amount of variability 
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If a variable is added to the 
model, R? becomes larger 
even if the variable added 
is not statistically signific- 
ant. The adjusted multiple 
coefficient of determina- 
tion compensates for the 
number of independent 
variables in the model. 
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explained by the estimated regression equation. With n denoting the number of obser- 
vations and p denoting the number of independent variables, the adjusted multiple 
coefficient of determination is computed as follows. 


ADJUSTED MULTIPLE COEFFICIENT OF DETERMINATION 


=i 
gaa een (15.9) 


For the Butler Trucking example with n = 10 and p = 2, we have 


R? = 1 — (1 — .9038) 2 8763 
a í 0=2=-1 ` 


Thus, after adjusting for the two independent variables, we have an adjusted multiple 
coefficient of determination of .8763. This value is provided by the Excel output in 
Figure 15.5 as Adjusted R Square = .8763 (see cell B18). 


E y aad E RE T 
Methods 
11. In exercise 1, the following estimated regression equation based on 10 observations 


was presented. 
¥ = 29.1270 + .5906x, + .4980x, 


The values of SST and SSR are 6724.125 and 6216.375, respectively. 
a. Find SSE. 
b. Compute R°. 
c. Compute RŽ. 
d. Comment on the goodness of fit. 
12. In exercise 2, 10 observations were provided for a dependent variable y and two inde- 
pendent variables x, and x,; for these data SST = 15,182.9 and SSR = 14,052.2. 
a. Compute R?. 
b. Compute R?. 
c. Does the estimated regression equation explain a large amount of the variability in 
the data? Explain. 
13. In exercise 3, the following estimated regression equation based on 30 observations 
was presented. 


$ = 17.6 + 3.8x, — 2.3x, + 7.6x, + 2.7.x, 


The values of SST and SSR are 1805 and 1760, respectively. 
a. Compute R?. 

b. Compute RZ. 

c. Comment on the goodness of fit. 


Applications 
14. R° in Shoe Sales Prediction. In exercise 4, the following estimated regression equa- 
tion relating sales to inventory investment and advertising expenditures was given. 


y = 25 + 10x, + 8x, 


The data used to develop the model came from a survey of 10 stores; for those data, 
SST = 16,000 and SSR = 12,000. 

a. For the estimated regression equation given, compute R°. 

b. Compute RZ. 
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c. Does the model appear to explain a large amount of variability in the data? 
Explain. 

15. R? in Theater Revenue Prediction. In exercise 5, the owner of Showtime Movie The- 
aters, Inc. used multiple regression analysis to predict gross revenue (y) as a function 
of television advertising (x,) and newspaper advertising (x,). The estimated regression 
equation was 


(@ 


DATA file $ = 83.2 + 2.29x, + 1.30x, 


Showti 
les The computer solution provided SST = 25.5 and SSR = 23.435. 


a. Compute and interpret R? and RŽ. 
b. When television advertising was the only independent variable, R* = .653 and 
R? = .595. Do you prefer the multiple regression results? Explain. 
DATA fi le 16. Quality of Fit in Predicting NFL Wins. In exercise 6, data were given on the average 
number of passing yards per attempt (Yds/Att), the number of interceptions thrown per 
attempt (Int/Att), and the percentage of games won (Win%) for a random sample of 

16 National Football League (NFL) teams for one full season. 

a. Did the estimated regression equation that uses only the average number of passing 
yards per attempt as the independent variable to predict the percentage of games 
won provide a good fit? 

b. Discuss the benefit of using both the average number of passing yards per attempt 
and the number of interceptions thrown per attempt to predict the percentage of 
games won. 

DATA fi le 17. Quality of Fit in Predicting House Prices. Revisit exercise 9, where we develop an 

estimated regression equation that can be used to predict the selling price given the 
number of bathrooms, square footage, and number of bedrooms in the house. 

a. Does the estimated regression equation provide a good fit to the data? Explain. 

b. In part (c) of exercise 9 you developed an estimated regression equation that predicts 
selling price given the square footage and number of bedrooms. Compare the fit for 
this simpler model to that of the model that also includes number of bathrooms as an 
independent variable. 

B 18. R° in Predicting Baseball Pitcher Performance. Refer to exercise 10, where Major 

DATA f ile League Baseball (MLB) pitching statistics were reported for a random sample of 

PitchingMLB 20 pitchers from the American League for one full season. 

a. In part (c) of exercise 10, an estimated regression equation was developed relating 
the average number of runs given up per inning pitched given the average number 
of strikeouts per inning pitched and the average number of home runs per inning 
pitched. What are the values of R? and RZ? 

b. Does the estimated regression equation provide a good fit to the data? Explain. 

c. Suppose the earned run average (ERA) is used as the dependent variable in 
part (c) instead of the average number of runs given up per inning pitched. Does 
the estimated regression equation that uses the ERA provide a good fit to the 
data? Explain. 


(@ 


PassingNFL 


(@ 


SpringHouses 


(Q 


15.4 Model Assumptions 


In Section 15.1 we introduced the following multiple regression model. 


MULTIPLE REGRESSION MODEL 
y=Po + Bix, + Box, + a + E (15.10) 


The assumptions about the error term e in the multiple regression model parallel those for 
the simple linear regression model. 
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ASSUMPTIONS ABOUT THE ERROR TERM e IN THE MULTIPLE REGRESSION 
MODEL y = Bo + BX% +*+ t Bp TE 
1. The error term € is a random variable with mean or expected value of zero; that 
is, E(e) = 0. 
Implication: For given values of x,,x5,..., Xps the expected, or average, value 
of y is given by 


EO) = Bo + Bix, + Box, + ++: + BY, (oun!) 


Equation (15.11) is the multiple regression equation we introduced in Sec- 
tion 15.1. In this equation, E(y) represents the average of all possible values 
of y that might occur for the given values of x,,x,,..., S 
2. The variance of € is denoted by o” and is the same for all values of the indepen- 
dent variables x,,x,,...,X,. 
Implication: The variance of y about the regression line equals o° and is the 
same for all values of x), x,,..., Xp- 
3. The values of € are independent. 
Implication: The value of e for a particular set of values for the independent 
variables is not related to the value of e for any other set of values. 
4. The error term € is a normally distributed random variable reflecting 
the deviation between the y value and the expected value of y given by 
Bo + Bix, + Box, + +--+ B,x,. 
Implication: Because Bp, B,,..., B, are constants for the given values of x, 
Xqy ++ Xs the dependent variable y is also a normally distributed random variable. 


To obtain more insight about the form of the relationship given by equation (15.11), 
consider the following two-independent-variable multiple regression equation. 


EO) = Bo + Bix, + Box, 


The graph of this equation is a plane in three-dimensional space. Figure 15.6 provides 
an example of such a graph. Note that the value of e shown is the difference between the 
actual y value and the expected value of y, E(y), when x, = x¥ and x, = x¥ 


FIGURE 15.6 Graph of the Regression Equation for Multiple Regression 


Analysis with Two Independent Variables 


y Value of y when 
xı = xi and x» = xX, 


Bo 


Plane corresponding 
to EO) = By + Bix + Box, 


X2 
x) 


Point corresponding to 
x = xi and x) = x5 
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In regression analysis, the term response variable is often used in place of the term 
dependent variable. Furthermore, since the multiple regression equation generates a plane 
or surface, its graph is called a response surface. 


15.5 Testing for Significance 


In this section we show how to conduct significance tests for a multiple regression relation- 
ship. The significance tests we used in simple linear regression were a ¢ test and an 

F test. In simple linear regression, both tests provide the same conclusion; that is, if the 
null hypothesis is rejected, we conclude that 6, # 0. In multiple regression, the f test and 
the F test have different purposes. 


1. The F test is used to determine whether a significant relationship exists between the 
dependent variable and the set of all the independent variables; we will refer to the 
F test as the test for overall significance. 

2. If the F test shows an overall significance, the f test is used to determine whether 
each of the individual independent variables is significant. A separate f test is con- 
ducted for each of the independent variables in the model; we refer to each of these 
t tests as a test for individual significance. 


In the material that follows, we will explain the F test and the f test and apply each to the 
Butler Trucking Company example. 


F Test 


The multiple regression model as defined in Section 15.4 is 
y = Po + Bix, + Box) + ++ + Bx, +e 
The hypotheses for the F test involve the parameters of the multiple regression model. 


Ay: B, = By = ++: = B, = 0 


H: One or more of the parameters is not equal to zero 


If H, is rejected, the test gives us sufficient statistical evidence to conclude that one or more 
of the parameters is not equal to zero and that the overall relationship between y and the set 
of independent variables x,, x,,..., x, is significant. However, if Hy cannot be rejected, we 
do not have sufficient evidence to conclude that a significant relationship is present. 

Before describing the steps of the F test, we need to review the concept of mean square. 
A mean square is a sum of squares divided by its corresponding degrees of freedom. In the 
multiple regression case, the total sum of squares has n — 1 degrees of freedom, the sum of 
squares due to regression (SSR) has p degrees of freedom, and the sum of squares due to 
error has n — p — 1 degrees of freedom. Hence, the mean square due to regression (MSR) 
is SSR/p and the mean square due to error (MSE) is SSE/(n — p — 1). 


R 
MSR = ae (15.12) 
and 
E 
MSE = ee (15.13) 
n-p-1 


As discussed in Chapter 14, MSE provides an unbiased estimate of o7, the variance of 
the error term e. If Hy: E, = B, = --- = B, = O is true, MSR also provides an unbiased 
estimate of o”, and the value of MSR/MSE should be close to 1. However, if H, is 

false, MSR overestimates g” and the value of MSR/MSE becomes larger. To determine 
how large the value of MSR/MSE must be to reject Hp), we make use of the fact that 

if H is true and the assumptions about the multiple regression model are valid, the 
sampling distribution of MSR/MSE is an F distribution with p degrees of freedom in the 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


15.5 Testing for Significance 703 


numerator and n — p — 1 in the denominator. A summary of the F test for significance 
in multiple regression follows. 


F TEST FOR OVERALL SIGNIFICANCE 
Ay: B; = By = +++ = B, = 9 


H: One or more of the parameters is not equal to zero 


TEST STATISTIC 
MSR 
= (15.14) 
MSE 
REJECTION RULE 
p-value approach: Reject H if p-value = a 


Critical value approach: Reject H, if F = F, 


where F, is based on an F distribution with p degrees of freedom in the numerator and 
n — p — 1 degrees of freedom in the denominator. 


Let us apply the F test to the Butler Trucking Company multiple regression problem. 
With two independent variables, the hypotheses are written as follows. 


Hy: B; = B, = 9 
H`: B, and/or B, is not equal to zero 


Figure 15.7 shows a portion of the Excel Regression tool output shown previously 
in Figure 15.5 with miles traveled (x,) and number of deliveries (x,) as the two independent 
variables. In the analysis of variance part of the output, we see that MSR = 10.8003 and 
MSE = .3285. Using equation (15.14), we obtain the test statistic. 


_ 10.8003 
3285 


= 32.9 


Regression Tool Output for the Butler Trucking Example with Two Independent Variables 


The Significance F value in 
cell F24 is the p-value used 
to test for overall 
significance. 


The p-value in cell E30 
is used to test for the 
29 individual significance 
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The label Significance F in 
cell F23 is used to identify 
the p-value in cell F24. 


Chapter 15 Multiple Regression 


Note that the F value in the Excel output is F = 32.8784; the value we calculated differs 
because we used rounded values for MSR and MSE in the calculation. Using a = .01, the 
p-value = 0.0003 in cell F24 indicates that we can reject Hp): B, = B, = 0 because the p-value 
is less than a = .01. Alternatively, Table 4 of Appendix B shows that with 2 degrees of freedom 
in the numerator and 7 degrees of freedom in the denominator, Fy, = 9.55. With 32.9 > 9.55, 
we reject Hy: B; = B, = 0 and conclude that a significant relationship is present between travel 
time y and the two independent variables, miles traveled and number of deliveries. 

As noted previously, MSE provides an unbiased estimate of o°, the variance of the error 
term e. Thus, the estimate of o° is MSE = .3285. The square root of MSE is the estimate 
of the standard deviation of the error term. As defined in Section 14.5, this standard 
deviation is called the standard error of the estimate and is denoted s. Hence, we have 
s = VMSE = V..3285 = .5731. Note that the value of the standard error of the estimate 
appears in cell B19 of Figure 15.7. 

Table 15.3 is the general form of the ANOVA table for multiple regression. The value of 
the F test statistic and its corresponding p-value in the last column can be used to make the 
hypothesis test conclusion. By reviewing the Excel output for Butler Trucking Company in 
Figure 15.7, we see that Excel’s analysis of variance table contains this information. 


t Test 


If the F test shows that the multiple regression relationship is significant, a ¢ test can be 
conducted to determine the significance of each of the individual parameters. The f test for 
individual significance follows. 


t TEST FOR INDIVIDUAL SIGNIFICANCE 


For any parameter B; 


TEST STATISTIC 
t=— (15. 15) 
REJECTION RULE 


p-value approach: Reject H, if p-value = a 


Critical value approach: Reject H) if t= —t,,. or if t = tip 


where t„n is based on a f distribution with n — p — 1 degrees of freedom. 


TABLE 15.3 General Form of the ANOVA Table for Multiple Regression 
with p Independent Variables 
Sum Degrees 
Source of Squares of Freedom Mean Square F p-Value 
_ SSR _ MSR 
Regression SSR Pp MSR = p MSE 
Error SSE n p MSE = Bek 
nopi 
Total SST m= il 
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In the test statistic, s, is the estimate of the standard deviation of b,. The value of s, will be 
provided by the computer software package. 

Let us conduct the ¢ test for the Butler Trucking regression problem. Refer to the section 
of Figure 15.7 that shows the Excel output for the ratio calculations. Values of b,, b,, s, , 
and s, are as follows. 


b, = .0611 s, = .0099 
b, = 9234 s, = .2211 


Using equation (15.15), we obtain the test statistic for the hypotheses involving parameters 


B, and p,. 

t = .0611/.0099 = 6.1717 

t = .9234/.2211 = 4.1764 
The t values in the Note that both of these f-ratio values and the corresponding p-values are provided by the 
Regression tool output Excel Regression tool output in Figure 15.7. Using a = .01, the p-values of .0005 and .0042 
are 6.1824 and 4.1763. on the Excel output indicate that we can reject Hy: 6, = 0 and H,: B, = 0. Hence, both 
The difference is due to parameters are statistically significant. Alternatively, Table 2 of Appendix B shows that with 
rounding. n—p— 1=10-— 2 -— 1 = 7 degrees of freedom, to), = 3.499. Because a 1717 > 3.499, 


we reject Ho: B, = 0. Similarly, with 4.1763 > 3.499, we reject Hp: B, = 


Multicollinearity 


We use the term independent variable in regression analysis to refer to any variable 
being used to predict or explain the value of the dependent variable. The term does not 
mean, however, that the independent variables themselves are independent in any statist- 
ical sense. On the contrary, most independent variables in a multiple regression problem 
are correlated to some degree with one another. For example, in the Butler Trucking 
example involving the two independent variables x, (miles traveled) and x, (number of 
deliveries), we could treat the miles traveled as the dependent variable and the number 
of deliveries as the independent variable to determine whether those two variables are 
themselves related. We could then compute the sample correlation coefficient r, to 
determine the extent to which the variables are related. Doing so yields r, 16. Thus, 
we find some degree of linear association between the two independent variables. In 
multiple regression analysis, multicollinearity refers to the correlation among the inde- 
pendent variables. 

To provide a better perspective of the potential problems of multicollinearity, let us 
consider a modification of the Butler Trucking example. Instead of x, being the number of 
deliveries, let x, denote the number of gallons of gasoline consumed. Clearly, x, (the miles 
traveled) and x, are related; that is, we know that the number of gallons of gasoline used 
depends on the number of miles traveled. Hence, we would conclude logically that x, and 
x, are highly correlated independent variables. 

Assume that we obtain the equation ŷ = b) + b,x, + b,x, and find that the F test shows 
the relationship to be significant. Then suppose we conduct a ż test on 6, to determine 
whether 6, ~ 0, and we cannot reject Hy: 8, = 0. Does this result mean that travel time is 
not related to miles traveled? Not necessarily. What it probably means is that with x, already 
in the model, x, does not make a significant contribution to determining the value of y. This 
interpretation makes sense in our example; if we know the amount of gasoline consumed, 


A sample correlation we do not gain much additional information useful in predicting y by knowing the miles 
coefficient greater than +.7 traveled. Similarly, a f test might lead us to conclude £, = 0 on the grounds that, with x, in 
or less than —.7 for two the model, knowledge of the amount of gasoline consumed does not add much. In addition, if 


independent variables isa Multicollinearity is present the signs and magnitudes of the estimated slope coefficients can 
rule of thumb warning of be misleading. 

potential problems with To summarize, in ¢ tests for the significance of individual parameters, the difficulty 
multicollinearity. caused by multicollinearity is that it is possible to conclude that none of the individual 
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parameters are significantly different from zero when an F test on the overall multiple 
regression equation indicates a significant relationship. Furthermore, multicollinearity can 
cause the estimated slope coefficients to be misleading. These problems are avoided when 
there is little correlation among the independent variables. 

Statisticians have developed several tests for determining whether multicollinearity is 
high enough to cause problems. According to the rule of thumb test, multicollinearity is 
a potential problem if the absolute value of the sample correlation coefficient exceeds .7 
for any two of the independent variables. The other types of tests are more advanced and 
beyond the scope of this text. 

If possible, every attempt should be made to avoid including independent variables that 
are highly correlated. In practice, however, strict adherence to this policy is rarely possi- 
ble. When decision makers have reason to believe substantial multicollinearity is present, 
they must realize that separating the effects of the individual independent variables on the 
dependent variable is difficult. 


NOTES + COMMENTS 


When the independent 
variables are highly cor- 
related, it is not possible 
to determine the separate 
effect of any particular 
independent variable on 
the dependent variable. 


Ordinarily, multicollinearity does not affect the way in which 
we perform our regression analysis or interpret the output 
from a study. However, when multicollinearity is severe—that 
is, when two or more of the independent variables are highly 
correlated with one another—we can have difficulty inter- 
preting the results of t tests on the individual parameters. In 
addition to the type of problem illustrated in this section, se- 
vere cases of multicollinearity have been shown to result in 
least squares estimates that have the wrong sign. That is, in 


simulated studies where researchers created the underlying 
regression model and then applied the least squares tech- 
nique to develop estimates of Bo, Bı, B2, and so on, it has 
been shown that under conditions of high multicollinearity 
the least squares estimates can have a sign opposite that 
of the parameter being estimated. For example, B, might 
actually be +10 and b,, its estimate, might turn out to be —2. 
Thus, little faith can be placed in the individual coefficients if 
multicollinearity is present to a high degree. 


EXERCISES 


Methods 
19. In exercise 1, the following estimated regression equation based on 10 observations 
was presented. 


$ = 29.1270 + .5906x, + .4980x, 


Here SST = 6724.125, SSR = 6216.375, s, = .0813, and s, = .0567. 
a. Compute MSR and MSE. ' i 
b. Compute F and perform the appropriate F test. Use a = .05. 
c. Perform a ¢ test for the significance of B,. Use a = .05. 
d. Perform a ¢ test for the significance of B,. Use a = .05. 
20. Refer to the data presented in exercise 2. The estimated regression equation for these 
data is 


$ = —18.4 + 2.01x, + 4.742, 


Here SST = 15,182.9, SSR = 14,052.2, s, = .2471, ands, = .9484. 
a. Test for a significant relationship among Xp X, and y. Use a = .05. 
b. Is 6, significant? Use a = .05. 
c. Is B, significant? Use a = .05. 
21. The following estimated regression equation was developed for a model involving two 
independent variables. 


$ = 40.7 + 8.63x, + 2.71x, 
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After x, was dropped from the model, the least squares method was used to obtain an 
estimated regression equation involving only x, as an independent variable. 


$ = 42.0 + 9.01x, 


a. Give an interpretation of the coefficient of x, in both models. 
b. Could multicollinearity explain why the coefficient of x, differs in the two models? 
If so, how? 


Applications 

22. Testing Significance in Shoe Sales Prediction. In exercise 4, the following estimated 
regression equation relating sales to inventory investment and advertising expenditures 
was given. 


y = 25 + 10x, + 8x, 


The data used to develop the model came from a survey of 10 stores; for these data 
SST = 16,000 and SSR = 12,000. 
a. Compute SSE, MSE, and MSR. 
b. Use an F test and a .05 level of significance to determine whether there is a rela- 
tionship among the variables. 
23. Testing Significance in Theater Revenue. Refer to exercise 5. 
a. Use a = .01 to test the hypotheses 


Ay: B, = By = 0 
H: B, and/or B, is not equal to zero 


for the model y = By + B,x, + Bx, + €, where 


xı = television advertising ($1000s) 
xX, = newspaper advertising ($1000s) 


b. Use a = .05 to test the significance of B,. Should x, be dropped from the model? 
c. Use a = .05 to test the significance of B,. Should x, be dropped from the model? 

24. Testing Significance in Predicting NFL Wins. The National Football League (NFL) 
records a variety of performance data for individuals and teams. A portion of the data 
showing the average number of passing yards obtained per game on offense (Off- 
Pass Yds/G), the average number of yards given up per game on defense (Def Yds/G), 
and the percentage of games won (Win%) for one full season follows. 


(Q 


DATA file 


NFL 2011 


Team OffPassYds/G DefYds/G Win% 
Arizona 222E) 3551 50.0 
Atlanta 262.0 333.6 62.5 
Baltimore ZN BY 288.9 75.0 
St. Louis 1794 3584 12.5 
Tampa Bay 228.1 394.4 25.0 
Tennessee 245.2 S5511 563 
Washington 235.8 339.8 Silks 


a. Develop an estimated regression equation that can be used to predict the percentage 
of games won given the average number of passing yards obtained per game on 
offense and the average number of yards given up per game on defense. 
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DATA file 


AutoResale 


DATA file 


PitchingMLB 


(@ 
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b. Use the F test to determine the overall significance of the relationship. What is your 
conclusion at the .05 level of significance? 

c. Use the ż test to determine the significance of each independent variable. What is 
your conclusion at the .05 level of significance? 
25. Auto Resale Value. The Honda Accord was named the best midsized car for re- 
sale value for 2018 by the Kelley Blue Book (Kelley Blue Book website). The file 
AutoResale contains mileage, age, and selling price for a sample of 33 Honda Accords. 
a. Develop an estimated regression equation that predicts the selling price of a 
used Honda Accord given the mileage and age of the car. 

b. Is multicollinearity an issue for this model? Find the correlation between the 
independent variables to answer this question. 

c. Use the F test to determine the overall significance of the relationship. What is 
your conclusion at the .05 level of significance? 

d. Use the ż test to determine the significance of each independent variable. What 
is your conclusion at the .05 level of significance? 

26. Testing Significance in Baseball Pitcher Performance. In exercise 10, data showing 
the values of several pitching statistics for a random sample of 20 pitchers from the 
American League of Major League Baseball were provided. In part (c) of this exercise 
an estimated regression equation was developed to predict the average number of runs 
given up per inning pitched (R/IP) given the average number of strikeouts per inning 
pitched (SO/IP) and the average number of home runs per inning pitched (HR/IP). 

a. Use the F test to determine the overall significance of the relationship. What is your 
conclusion at the .05 level of significance? 

b. Use the ¢ test to determine the significance of each independent variable. What is 
your conclusion at the .05 level of significance? 


15.6 Using the Estimated Regression Equation 
for Estimation and Prediction 


The procedures for estimating the mean value of y and predicting an individual value of y 
in multiple regression are similar to those in regression analysis involving one independent 
variable. First, recall that in Chapter 14 we showed that the estimated regression equation 
¥ = b, + b,x can be used to estimate the mean value of y for a given value of x as well as 
to predict an individual value of y for a given value of x. In multiple regression we use the 
same procedure. That is, we substitute the value of x,,.x,,..., X, into the estimated regres- 
sion equation and use the corresponding value of ŷ to estimate the mean value of y given 
Xi Xz - - - , X, as well as to predict an individual of y given x), x,,..., X, 

To illustrate the procedure in multiple regression, suppose that for the Butler Trucking 
example we want to use the estimated regression equation involving x, (miles traveled) 
and x, (number of deliveries) to develop two interval estimates: 


1. A confidence interval of the mean travel time for all trucks that travel 100 miles 
and make two deliveries 

2. A prediction interval of the travel time for one specific truck that travels 100 miles 
and makes two deliveries 


The Excel output in Figure 15.5 showed the estimated regression equation is 
ý = —.8687 + .0611x, + .9234x, 
With x, = 100 and x, = 2, we obtain the following value of ý: 


ý = —.8687 + .0611(100) + .9234(2) = 7.09 


Hence a point estimate of the mean travel time for all trucks that travel 100 miles and make 
two deliveries is approximately 7 hours. And a prediction of the travel time for one specific 
truck that travels 100 miles and makes two deliveries is also 7 hours. 
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TABLE 15.4 The 95% Prediction Intervals for Butler Trucking 


Prediction Interval 


Value of x, Value of x, Lower Limit Upper Limit 
50 2 2.414 5.656 
50 3 3.368 6.548 
50 4 4.157 7.607 
100 2 5.500 8.683 
100 3 6.520 9.510 
100 4 V362 1S5 


The formulas required to develop confidence and prediction intervals for multiple 
regression are beyond the scope of the text. And in multiple regression hand computation 
is simply not practical. Although Excel’s Regression tool does not have an option for com- 
puting interval estimates, some software packages do provide confidence and prediction 
intervals. The interpretation of these intervals is the same as for simple linear regression. 
Table 15.4 shows the 95% prediction intervals for the Butler Trucking problem for selected 
values of x, and x,. We see that the prediction interval of the travel time for one specific 
truck that travels 100 miles and makes two deliveries is approximately 5.5 to 8.7 hours. 


ES 

Methods 

27. In exercise 1, the following estimated regression equation based on 10 observations 
was presented. 


$ = 29.1270 + .5906x, + .4980x, 


a. Develop a point estimate of the mean value of y when x, = 180 and x, = 310. 
b. Predict an individual value of y when x, = 180 and x, = 310. 
28. Refer to the data in exercise 2. The estimated regression equation for those data is 


(U 


DATA file 
Exer2 $ = —18.4 + 2.01x, + 4.74x, 


a. Develop a point estimate of the mean value of y when x, = 45 and x, = 15. 
b. Develop a 95% prediction interval for y when x, = 45 and x, = 15. 


Applications 

29. Confidence and Prediction Intervals for Theater Revenue. In exercise 5, the owner 
of Showtime Movie Theaters, Inc. used multiple regression analysis to predict gross 
revenue (y) as a function of television advertising (x,) and newspaper advertising (x,). 
The estimated regression equation was 


$ = 83.23 + 2.29x, + 1.30x, 


a. What is the gross revenue expected for a week when $3500 is spent on television 
advertising (x, = 3.5) and $1800 is spent on newspaper advertising (x, = 1.8)? 

b. Provide a 95% prediction interval for next week’s revenue, assuming that the 
advertising expenditures will be allocated as in part (a). 

DATA fi le 30. Confidence and Prediction Intervals for NFL Wins. In exercise 24, an estimated 

regression equation was developed relating the percentage of games won by a team 

in the National Football League during a complete season to the average number of 

passing yards obtained per game on offense and the average number of yards given up 

per game on defense during the season. 


(@ 


NFL2011 
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a. Predict the percentage of games won for a particular team that averages 225 passing 
yards per game on offense and gives up an average of 300 yards per game on defense. 
b. Develop a 95% prediction interval for the percentage of games won for a particular 
team that averages 225 passing yards per game on offense and gives up an average 
of 300 yards per game on defense. 
. 31. Confidence and Prediction Intervals for Auto Resale Value. Refer to exercise 25. 
DATA f ile Use the estimated regression equation from part (a) to answer the following questions. 
a. Estimate the selling price of a four-year-old Honda Accord with mileage of 
40,000 miles. 
a. Develop a 95% confidence interval for the selling price of a car with the data in 
part (a). 
b. Develop a 95% prediction interval for the selling price of a particular car having 
the data in part (a). 


(@ 


AutoResale 


15.7 Categorical Independent Variables 


The independent variables Thus far, the examples we have considered involved quantitative independent variables 

may be categorical or such as student population, distance traveled, and number of deliveries. In many situations, 

quantitative. however, we must work with categorical independent variables such as gender (male, 
female) and method of payment (cash, credit card, check). The purpose of this section is to 
show how categorical variables are handled in regression analysis. To illustrate the use and 
interpretation of a categorical independent variable, we will consider a problem facing the 
managers of Johnson Filtration, Inc. 


An Example: Johnson Filtration, Inc. 


Johnson Filtration, Inc., provides maintenance service for water-filtration systems 
throughout southern Florida. Customers contact Johnson with requests for maintenance 
service on their water-filtration systems. To estimate the service time and the service 
cost, Johnson’s managers want to predict the repair time necessary for each maintenance 
request. Hence, repair time in hours is the dependent variable. Repair time is believed to 
be related to two factors, the number of months since the last maintenance service and the 
type of repair problem (mechanical or electrical). Data for a sample of 10 service calls are 
reported in Table 15.5. 

Let y denote the repair time in hours and x, denote the number of months since the last 
maintenance service. The regression model that uses only x, to predict y is 


y=Pot Bix, +e 


TABLE 15.5 Data for the Johnson Filtration Example 


Service Months Since Repair Time 
Call Last Service Type of Repair in Hours 
1 2 Electrical BS) 
2 6 Mechanical 3.0 
3 8 Electrical 4.8 
4 3 Mechanical 1.8 
5 2 Electrical 29. 
6 7 Electrical 4.9 
y 9 Mechanical 4.2 
8 8 Mechanical 4.8 
9 4 Electrical 4.4 
10 6 Electrical AL'S) 
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Regression Tool Output for the Johnson Filtration Example with 
Months Since Last Service Call as the Independent Variable 


The Excel Regression tool 
output appears in a new 
worksheet because we se- 
lected New Worksheet Ply 
as the Output option in the 
Regression dialog box. 


Using Excel’s Regression tool to develop the estimated regression equation, we obtained 
the Excel output shown in Figure 15.8. The estimated regression equation is 


ý = 2.1473 + .3041x, (15.16) 


At the .05 level of significance, the p-value of .0163 for the ¢ (or F) test indicates that the 
number of months since the last service is significantly related to repair time. R Square = 
.5342 indicates that x, alone explains 53.42% of the variability in repair time. 

To incorporate the type of repair into the regression model, we define the following variable. 


i if the type of repair is mechanical 
x, = 


1 if the type of repair is electrical 


In regression analysis x, is called a dummy or indicator variable. Using this dummy 
variable, we can write the multiple regression model as 


y = Bo + Bix, + Box, + € 


Table 15.6 is the revised data set that includes the values of the dummy variable. Using 
Excel and the data in Table 15.6, we can develop estimates of the model parameters. The 
Excel Regression tool output in Figure 15.9 shows that the estimated multiple regression 
equation is 


$ = 9305 + .3876x, + 1.2627x, (15.17) 


At the .05 level of significance, the p-value of .0010 associated with the F test (F = 21.357) 
indicates that the regression relationship is significant. The ż test part of the printout in 
Figure 15.9 shows that both months since last service (p-value = .0004) and type of 
repair (p-value = .0051) are statistically significant. In addition, R Square = 0.8952 
and Adjusted R Square = 0.8190 indicate that the estimated regression equation does 
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TABLE 15.6 Data for the Johnson Filtration Example with Type of Repair Indicated 


by a Dummy Variable (x, = 0 for Mechanical; x, = 1 for Electrical) 


Months Since Type of Repair Time 
Customer Last Service (x,) Repair (x,) in Hours (y) 


2.9 
3.0 
4.8 
1.8 
2.9 
4.9 
4.2 
4.8 
4.4 
4.5 


(@ 


DATA file 


Johnson 
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Regression Tool Output for the Johnson Filtration Example with 
Months Since Last Service Call and Type of Repair as the 
Independent Variables 
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a good job of explaining the variability in repair times. Thus, equation (15.17) should 
prove helpful in predicting the repair time necessary for the various service calls. 


Interpreting the Parameters 


The multiple regression equation for the Johnson Filtration example is 


E) = By + Bixi + Box, (15.18) 
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To understand how to interpret the parameters By, 8,, and 8, when a categorical variable 
is present, consider the case when x, = 0 (mechanical repair). Using E(y | mechanical) to 
denote the mean or expected value of repair time given a mechanical repair, we have 


E(y | mechanical) = By + Bix, + B0) = By + Bix (15.19) 
Similarly, for an electrical repair (x, = 1), we have 


EQ | electrical) = B, + B,x, + B,(1) = Bo + Bix + B2 (15.20) 
= (Bo + Bo) + Bix, 


Comparing equations (15.19) and (15.20), we see that the mean repair time is a linear 
function of x, for both mechanical and electrical repairs. The slope of both equations is B,, 
but the y-intercept differs. The y-intercept is Bù in equation (15.19) for mechanical repairs 
and ( 6) + B,) in equation (15.20) for electrical repairs. The interpretation of 8, is that it 
indicates the difference between the mean repair time for an electrical repair and the mean 
repair time for a mechanical repair. 

If B, is positive, the mean repair time for an electrical repair will be greater than that for 
a mechanical repair; if 6, is negative, the mean repair time for an electrical repair will be 
less than that for a mechanical repair. Finally, if 6, = 0, there is no difference in the mean 
repair time between electrical and mechanical repairs and the type of repair is not related to 
the repair time. 

Using the estimated multiple regression equation y = .9305 + .3876x, + 1.2627x,, we 
see that .9305 is the estimate of 6,, .3876 is the estimate of B,, and 1.2627 is the estimate 
of B,. Thus, when x, = 0 (mechanical repair) 


¥ = .9305 + .3876x, (15.21) 


and when x, = 1 (electrical repair) 


9305 + .3876x, + 1.2627(1) (15.22) 
= 2.1932 + .3876x, 


y 


In effect, the use of a dummy variable for type of repair provides two estimated regression 
equations that can be used to predict the repair time, one corresponding to mechanical 
repairs and one corresponding to electrical repairs. In addition, with b, = 1.2627, we learn 
that, on average, electrical repairs require 1.2627 hours longer than mechanical repairs. 

Figure 15.10 is the plot of the Johnson data from Table 13.6. Repair time in hours 
(y) is represented by the vertical axis and months since last service (x,) is represented 
by the horizontal axis. A data point for a mechanical repair is indicated by an M and a 
data point for an electrical repair is indicated by an E. Equations (15.21) and (15.22) are 
plotted on the graph to show graphically the two equations that can be used to predict 
the repair time, one corresponding to mechanical repairs and one corresponding to 
electrical repairs. 


More Complex Categorical Variables 


A categorical variable with Because the categorical variable for the Johnson Filtration example had two levels (mech- 
k levels must be modeled anical and electrical), defining a dummy variable with zero indicating a mechanical repair 
using k — 1 dummy vari- and one indicating an electrical repair was easy. However, when a categorical variable 
ables. Care must be taken has more than two levels, care must be taken in both defining and interpreting the dummy 
in defining and interpreting variables. As we will show, if a categorical variable has k levels, k — 1 dummy variables are 
the dummy variables. required, with each dummy variable being coded as 0 or 1. 
For example, suppose a manufacturer of copy machines organized the sales territories 
for a particular state into three regions: A, B, and C. The managers want to use regres- 
sion analysis to help predict the number of copiers sold per week. With the number of 
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FIGURE 15.10 Scatter Diagram for the Johnson Filtration Repair Data 


from Table 15.6 


Repair Time (hours) 


M = mechanical repair 
E = electrical repair 


Xj 
0 1 2 3 4 5) 6 7 8 9 10 


Months Since Last Service 


units sold as the dependent variable, they are considering several independent variables 
(the number of sales personnel, advertising expenditures, and so on). Suppose the man- 
agers believe sales region is also an important factor in predicting the number of copiers 
sold. Because sales region is a categorical variable with three levels, A, B and C, we 
will need 3 — 1 = 2 dummy variables to represent the sales region. Each variable can be 
coded 0 or 1 as follows. 


_ l if sales region B 


Xj . 
0 otherwise 
1 if sales region C 
X = . 
2 0 otherwise 


With this definition, we have the following values of x, and x. 


Region Xx X2 
A 0 0 
B 1 o 
E 0 1 


Observations corresponding to region A would be coded x, = 0, x, = 0; observations cor- 
responding to region B would be coded x, = 1, x, = 0; and observations corresponding to 
region C would be coded x, = 0, x, = 1. 

The regression equation relating the expected value of the number of units sold, E(y), to 
the dummy variables would be written as 


EO) = Bo + Bix, + Box» 
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To help us interpret the parameters By, 8,, and 6,, consider the following three variations 
of the regression equation. 


EQ | region A) = By + B,(0) + B,(0) = Bo 
EQ | region B) = By + (1) + B,(0) = By + B, 
EQ | region C) = By + B,(0) + (1) = By + By 


Thus, 6, is the mean or expected value of sales for region A; £; is the difference between 
the mean number of units sold in region B and the mean number of units sold in region A; 
and B, is the difference between the mean number of units sold in region C and the mean 
number of units sold in region A. 

Two dummy variables were required because sales region is a categorical variable with 
three levels. But the assignment of x, = 0, x, = 0 to indicate region A, x, = 1, x, = 0 to in- 
dicate region B, and x, = 0, x, = 1 to indicate region C was arbitrary. For example, we could 
have chosen x, = 1, x, = 0 to indicate region A, x, = 0, x, = 0 to indicate region B, and 
x, = 0, x, = 1 to indicate region C. In that case, 6, would have been interpreted as the mean 
difference between regions A and B and $, as the mean difference between regions C and B. 

The important point to remember is that when a categorical variable has k levels, k — 1 
dummy variables are required in the multiple regression analysis. Thus, if the sales region 
example had a fourth region, labeled D, three dummy variables would be necessary. For 
example, the three dummy variables can be coded as follows. 


X2 X3 


k if sales region B h if sales region C : if sales region D 
x= = = 


0 otherwise 0 otherwise 0 otherwise 


EXERCISES 


eeeeoveeeeeeeeeeeeee@ 


Methods 

32. Consider a regression study involving a dependent variable y, a quantitative independent 
variable x,, and a categorical independent variable with two levels (level 1 and level 2). 
a. Write a multiple regression equation relating x, and the categorical variable to y. 

b. What is the expected value of y corresponding to level 1 of the categorical variable? 
c. What is the expected value of y corresponding to level 2 of the categorical variable? 
d. Interpret the parameters in your regression equation. 

33. Consider a regression study involving a dependent variable y, a quantitative independent 
variable x,, and a categorical independent variable with three possible levels (level 1, 
level 2, and level 3). 

a. How many dummy variables are required to represent the categorical variable? 
b. Write a multiple regression equation relating x, and the categorical variable to y. 
c. Interpret the parameters in your regression equation. 


Applications 
34. Predicting Fast Food Sales. Management proposed the following regression model to 
predict sales at a fast-food outlet. 


Y = Bo + Bix, + Box. + Bax + € 
where 
x, = number of competitors within one mile 


x, = population within one mile (1000s) 


1 if drive-up window present 
x, = 
: 0 otherwise 


sales ($1000s) 


< 
II 
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(Q 


DATA file 


Repair 
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The following estimated regression equation was developed after 20 outlets were 
surveyed. 


$ = 10.1 — 4.2x, + 6.8x, + 15.3x, 


a. What is the expected amount of sales attributable to the drive-up window? 

b. Predict sales for a store with two competitors, a population of 8000 within one 
mile, and no drive-up window. 

c. Predict sales for a store with one competitor, a population of 3000 within 1 mile, 
and a drive-up window. 

35. Predicting Repair Time. Refer to the Johnson Filtration problem introduced in this sec- 
tion. Suppose that in addition to information on the number of months since the machine 
was serviced and whether a mechanical or an electrical repair was necessary, the managers 
obtained a list showing which repairperson performed the service. The revised data follow. 


Repair Time Months Since 

in Hours Last Service Type of Repair Repairperson 
2) 2 Electrical Dave Newton 
3.0 6 Mechanica Dave Newton 
4.8 8 Electrical Bob Jones 
1.8 3 Mechanica Dave Newton 
29] 2 Electrical Dave Newton 
4.9 7 Electrical Bob Jones 
4.2 9 Mechanica Bob Jones 
4.8 8 Mechanica Bob Jones 
4.4 4 Electrical Bob Jones 
4.5 6 Electrical Dave Newton 


a. Ignore for now the months since the last maintenance service (x, ) and the repair- 
person who performed the service. Develop the estimated simple linear regression 
equation to predict the repair time (y) given the type of repair (x,). Recall that 
x, = 0 if the type of repair is mechanical and 1 if the type of repair is electrical. 

b. Does the equation that you developed in part (a) provide a good fit for the observed 
data? Explain. 

c. Ignore for now the months since the last maintenance service and the type of repair 
associated with the machine. Develop the estimated simple linear regression equa- 
tion to predict the repair time given the repairperson who performed the service. Let 
x; = 0 if Bob Jones performed the service and x, = 1 if Dave Newton performed the 
service. 

d. Does the equation that you developed in part (c) provide a good fit for the observed 
data? Explain. 

36. Model Extension for Predicting Repair Time. This problem is an extension of the 

situation described in exercise 35. 

a. Develop the estimated regression equation to predict the repair time given the 
number of months since the last maintenance service, the type of repair, and the 
repairperson who performed the service. 

b. At the .05 level of significance, test whether the estimated regression equation 
developed in part (a) represents a significant relationship between the independent 
variables and the dependent variable. 

c. Is the addition of the independent variable x,, the repairperson who performed the 
service, statistically significant? Use a = .05. What explanation can you give for 
the results observed? 

37. Pricing Refrigerators. Best Buy, a nationwide retailer of electronics, computers, and 
appliances, sells several brands of refrigerators. A random sample of models of full 
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size refrigerators prices sold by Best Buy and the corresponding cubic feet (cu. ft.) and 
list price follow (Best Buy website). 


Model Cu. Ft. List Price 
Frigidaire Gallery Custom-Flex Top-Freezer Refrigerator 18.3 $899.99 
GE French Door Refrigerator 24.8 $1599.99 
r ; GE Frost-Free Side-by-Side Refrigerator with Thru-the-Door Ice 25.4 $1599.99 
= DATA file and Water 
RefrigeratorSizePrice Whirlpool Top-Freezer Refrigerator 19.3 $749.99 
GE Frost-Free Top-Freezer Refrigerator 1S $599.99 
Whirlpool French Door Refrigerator with Thru-the-Ice 19.6 $1619.99 
and Door Water 
Samsung French Door Refrigerator 25.0 $999.99 
Samsung Side-by-Side Refrigerator 24.5 $1299.99 
Whirlpool Side-by-Side Refrigerator with Thru-the-Door 25.4 $1299.99 
Ice and Water 
Frigidaire Gallery Frost-Free Side-by-Side Refrigerator with 26.0 $1299.99 
Thru-the-Door Ice and Water 
Frigidaire Side-by-Side Refrigerator with 25.6 $1099.99 
Thru-the-Door Ice and Water 
Frigidaire Top-Freezer Refrigerator 18.0 $579.99 
Whirlpool French Door Refrigerator with Thru-the-Door 25.0 $2199.99 
Ice and Water 
Whirlpool Top-Freezer Refrigerator 20.5 $849.99 
GE Frost-Free Top-Freezer Refrigerator 155 $549.99 
Samsung 4-Door French Door Refrigerator with Thru-the-Door 28.2 $2599.99 
Ice and Water 
Samsung Showcase 4-Door French Door Refrigerator 27.8 $2999.99 
Samsung 3-Door French Door Refrigerator with Thru-the-Door 24.6 $2399.99 
Ice and Water 
Frigidaire Side-by-Side Refrigerator with Thru-the-Door 22.6 $1099.99 
Ice and Water 
GE Side-by-Side Refrigerator with Thru-the-Door Ice and Water 21.8 $1499.99 
GE Bottom-Freezer Refrigerator 20.9 $1649.99 


a. Develop the estimated simple linear regression equation to show how list price is 
related to the independent variable cubic feet. 

b. At the .05 level of significance, test whether the estimated regression equation 
developed in part (a) indicates a significant relationship between list price and 
cubic feet. 

c. Develop a dummy variable that will account for whether the refrigerator has 
the thru-the-door ice and water feature. Code the dummy variable with a value 
of 1 if the refrigerator has the thru-the-door ice and water feature and with 
0 otherwise. Use this dummy variable to develop the estimated multiple regres- 
sion equation to show how list price is related to cubic feet and the thru-the-door 
ice and water feature. 

d. At a = .05, is the thru-the-door ice and water feature a significant factor in the list 
price of a refrigerator? 

38. Risk of a Stroke. A 10-year study conducted by the American Heart Association 
provided data on how age, blood pressure, and smoking relate to the risk of strokes. 
Assume that the following data are from a portion of this study. Risk is interpreted 
as the probability (times 100) that the patient will have a stroke over the next 10-year 
period. For the smoker variable, 1 indicates a smoker and 0 indicates a nonsmoker. 
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Risk Age Blood Presure Smoker 
12 57 152 (0) 
24 67 163 0) 
13 58 155) 0) 
> : 56 86 177 1 
© DATA file 28 59 196 0 
Stroke 51 76 189 1 
18 56 155 1 
31 78 120 0) 
37 80 135 il 
15 78 98 (0) 
22 Val 152 (0) 
36 70 173 il 
15 67 135 1 
48 Wit 209 1 
15 60 199. (0) 
36 82 Wale 1 
8 66 166 (0) 
34 80 125 1 
3 62 WZ (0) 
37 59 207 il 


a. Develop an estimated regression equation that relates risk of a stroke to the person’s 
age, blood pressure, and whether the person is a smoker. 

b. Is smoking a significant factor in the risk of a stroke? Explain. Use a = .05. 

c. What is the probability of a stroke over the next 10 years for Art Speen, a 68-year- 
old smoker who has blood pressure of 175? What action might the physician 
recommend for this patient? 


15.8 Residual Analysis 


In Chapter 14 we showed how a residual plot against the independent variable x can be 
used to validate the assumptions for a simple linear regression model. Because multiple 
regression analysis deals with two or more independent variables, we would have to exam- 
ine a residual plot against each of the independent variables to use this approach. The more 
common approach in multiple regression analysis is to develop a residual plot against the 
predicted values y. 


Residual Plot Against y 


A residual plot against the predicted values } represents the predicted value of the depend- 
ent variable ŷ on the horizontal axis and the residual values on the vertical axis. A point is 
plotted for each residual. The first coordinate for each point is given by 9; and the second 
coordinate is given by the corresponding value of the ith residual y; — ),. For the Butler 
Trucking multiple regression example, the estimated regression equation that we developed 
using Excel (see Figure 15.5) is 


J; = —.8687 + 061 1x, + .9234x, 


where x, = miles traveled and x, = number of deliveries. Table 15.7 shows the predicted 
values and residuals based on this equation. The residual plot against ŷ for Butler Trucking 
is shown in Figure 15.11. The residual plot does not indicate any abnormalities. 

A residual plot against y can also be used when performing residual analysis in simple 
linear regression. In fact, the pattern for a residual plot against ŷ in simple linear regression 
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TABLE 15.7 Predicted Values and Residuals for Butler Trucking 


Miles Traveled Deliveries Travel Time Predicted Time Residual 


(xı) (x2) (y) (y) y- ý) 
100 4 9.3 8.9385 0.3615 
50 3 4.8 4.9583 —0.1583 
100 4 8.9 8.9385 -0.0385 
100 2 6.5 7.0916 —0.5916 
50 2 4.2 4.0349 0.1651 
80 2 6.2 5.8689 0.3311 
75 3 7.4 6.4867 0.9133 
65 4 6.0 6.7987 —0.7987 
90 3 7.6 7.4037 0.1963 
90 2 6.1 6.4803 -0.3803 


FIGURE 15.11 Residual Plot Against the Predicted Time y for Butler Trucking 


4% 
= 
cs 0 
al 
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Predicted Time 


is the same as the pattern of the residual plot against x. Thus, for simple linear regression 
the residual plot against ŷ and the residual plot against x provide the same information. In 
multiple regression analysis, however, it is preferable to use the residual plot against ŷ to 
determine whether the model’s assumptions are satisfied. 


Standardized Residual Plot Against y 


In Chapter 14 we pointed out that standardized residuals were frequently used in residual 
plots. We showed how to construct a standardized residual plot against x and discussed how 
the standardized residual plot could be used to identify outliers and provide insight about 
the assumption that the error term e has a normal distribution. Recall that we recommended 
considering any observation with a standardized residual of less than —2 or greater than 
+2 as an outlier. With normally distributed errors, standardized residuals should be outside 
these limits approximately 5% of the time. 

In multiple regression analysis, the computation of the standardized residuals is too 
complex to be done by hand. As we showed in Section 14.8, Excel’s Regression tool can 
be used to compute the standard residuals. In multiple regression analysis we use the same 
procedure to compute the standard residuals. Instead of developing a standardized residual 
plot against each of the independent variables, we will construct one standardized residual 
plot against the predicted values ĵ. 
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FIGURE 15.12 Standardized Residual Plot Against the Predicted Values y 


for Butler Trucking 
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Note: Rows 1—34 are hidden. 


Figure 15.12 shows the standard residuals and the corresponding standardized residual 
plot against (Predicted Time) for Butler Trucking developed using Excel’s Regression 
tool and Excel’s chart tools. The standardized residual plot does not indicate any abnor- 
malities, and no standard residual is less than —2 or greater than +2. Note that the pattern 
of the standardized residual plot against is the same as the pattern of the residual plot 
against » shown in Figure 15.11. But the standardized residual plot is preferred because it 
enables us to check for outliers and determine whether the assumption of normality for the 
regression model is reasonable. 


EXERCISES 


Methods 
39. Data for two variables, x and y, follow. 


x] 2 3 4 5 


t 


y 13 7 5 ll 14 


a. Develop the estimated regression equation for these data. 

b. Plot the residuals against ŷ. Does the residual plot support the assumptions about e€? 
Explain. 

c. Plot the standardized residuals against ý. Do any outliers appear in these data? Explain. 
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40. Data for two variables, x and y, follow. 


x, |22 24 26 28 40 


L 


y, |12 21 31 35 70 


a. Develop the estimated regression equation for these data. 

b. Compute the standardized residuals for these data. Can any of these observations be 
classified as an outlier? Explain. 

c. Develop a standardized residual plot against $. Does the residual plot support the 
assumptions about e? Explain. 


Applications 

41. Detecting Outliers in Theater Revenue. Exercise 5 gave the following data on 
weekly gross revenue ($1000s), television advertising expenditures ($1000s), and 
newspaper advertising expenditures ($1000s) for Showtime Movie Theaters. 


Weekly Gross Revenue Television Advertising Newspaper Advertising 
($1000s) ($1000s) ($1000s) 
oe 96 5.0 AES 
SS DATA file 7 n = 
Showtime 99 2 DS 
25 3.0 3.) 
94 39 23 
94 25 4.2 
94 3.0 25 
a. Find an estimated regression equation relating weekly gross revenue to television 
advertising expenditures and newspaper advertising expenditures. 
b. Plot the standardized residuals against ŷ. Does the residual plot support the assump- 
tions about e? Explain. 
c. Check for any outliers in these data. What are your conclusions? 
42. Predicting Sports Car Prices. The following table reports the price, horsepower, and 
14-mile speed for 16 popular sports and GT cars. 
Curb Speed at 
Price Weight Y Mile 
Sports & GT Car ($1000s) (Ib.) Horsepower (mph) 
Accura Integra Type R 25.035 257 195 90.7 
Accura NSX-T 93.758 3066 290 108.0 
2) $ BMW Z3 2.8 40.900 2844 189 ere 
SÆ DAIA file Chevrolet Camaro Z28 24.865 3439 305 103.2 
Auto2 Chevrolet Corvette Convertible 50.144 3246 345 102.1 
Dodge Viper RT/10 69.742 3319 450 12 
Ford Mustang GT 23.200 3227. 225 Sle 
Honda Prelude Type SH 26.382 3042 195 89.7 
Mercedes-Benz CLK320 44.988 3240 ZnS 93.0 
Mercedes-Benz SLK230 42.762 3025 185 92S 
Mitsubishi 3000GT VR-4 47.518 S737. 320 99.0 
Nissan 240SX SE 25.066 2862 155 84.6 
Pontiac Firebird Trans Am 27.770 3455 305 103.2 
Porsche Boxster 45.560 2822 201 F522 
Toyota Supra Turbo 40.989 3505 320 105.0 
Volvo C70 41.120 3285 236 97.0 
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a. Find the estimated regression equation, which uses price and horsepower to predict 
Y4-mile speed. 
b. Plot the standardized residuals against ŷ. Does the residual plot support the assump- 
tion about e? Explain. 
c. Check for any outliers. What are your conclusions? 
: 43. Predicting Golf Scores. The Ladies Professional Golfers Association (LPGA) 
DATA f ile maintains statistics on performance and earnings for members of the LPGA Tour. Year- 
LẸGA2014 end performance statistics for 134 golfers for 2014 appear in the file 20/4LPGAStats 

(LPGA website). Earnings ($1000s) is the total earnings in thousands of dollars; Scor- 

ing Avg. is the scoring average for all events; Greens in Reg. is the percentage of time 

a player is able to hit the greens in regulation; and Putting Avg. is the average number 

of putts taken on greens hit in regulation. A green is considered hit in regulation if any 

part of the ball is touching the putting surface and the difference between par for the 

hole and the number of strokes taken to hit the green is at least 2. 

a. Develop an estimated regression equation that can be used to predict the scoring 
average given the percentage of time a player is able to hit the greens in regulation 
and the average number of putts taken on greens hit in regulation. 

b. Plot the standardized residuals against ý. Does the residual plot support the assump- 
tion about e? Explain. 

c. Check for any outliers. What are your conclusions? 

d. Are there any influential observations? Explain. 


(@ 


15.9 Practical Advice: Big Data and Hypothesis 
Testing in Multiple Regression 


In Chapter 14, we observed that in simple linear regression, the p-value for the test of the 
hypothesis H,: 8, = 0 decreases as the sample size increases. Likewise, for a given level 
of confidence, the confidence interval for 64, the confidence interval for the mean value of 
y, and the prediction interval for an individual value of y each narrows as the sample size 
increases. These results extend to multiple regression. As the sample size increases: 


the p-value for the F test used to determine whether a significant relationship exists 
between the dependent variable and the set of all independent variables in the regres- 
sion model decreases; 

the p-value for each of t tests used to determine whether a significant relationship 
exists between the dependent variable and an individual independent variable in the 
regression model decreases; 

the confidence interval for the slope parameter associated with each individual inde- 
pendent variable narrows; 

the confidence interval for the mean value of y narrows; 

the prediction interval for an individual value of y narrows. 


Thus, the interval estimates for the slope parameter associated with each individual 
independent variable, the mean value of y, and predicted individual value of y will become 
more precise as the sample size increases. And we are more likely to reject the hypo- 
thesis that a relationship does not exist between the dependent variable and the set of all 
individual independent variables in the model as the sample size increases. And for each 
individual independent variable, we are more likely to reject the hypothesis that a relation- 
ship does not exist between the dependent variable and the individual independent variable 
as the sample size increases. Even when severe multicollinearity is present, if the sample 
is sufficiently large, independent variables that are highly correlated may each have a 
significant relationship with the dependent variable. But this does not necessarily mean that 
these results become more reliable as the sample size increases. 

No matter how large the sample used to estimate the multiple regression model, we 
must be concerned about the potential presence of nonsampling error in the data. It is 
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important to carefully consider whether a random sample of the population of interest has 
actually been taken. If nonsampling error is introduced in the data collection process, the 
likelihood of making a Type I or Type II error on hypothesis tests in multiple regression 
may be higher than if the sample data are free of nonsampling error. Furthermore, multicol- 
linearity may cause the estimated slope coefficients to be misleading; this problem persists 
as the size of the sample used to estimate the multiple regression model increases. Finally, 
it is important to consider whether the statistically significant relationship(s) in the multiple 
regression model are of practical significance. 

Although multiple regression is an extremely powerful statistical tool, no business 
decision should be based exclusively on hypothesis testing in multiple regression. Non- 
sampling error may lead to misleading results. If severe multicollinearity is present, we 
must be cautious in interpreting the estimated slope coefficients. And practical significance 
should always be considered in conjunction with statistical significance; this is particularly 
important when a hypothesis test is based on an extremely large sample, because p-values 
in such cases can be extremely small. When executed properly, hypothesis tests in multiple 
regression provide evidence that should be considered in combination with information 
collected from other sources to make the most informed decision possible. 


SUMMARY 

In this chapter, we introduced multiple regression analysis as an extension of simple linear 
regression analysis presented in Chapter 14. Multiple regression analysis enables us to 
understand how a dependent variable is related to two or more independent variables. The 
regression equation EO) = By + Bix + Box, + +++ + B,x, shows that the expected value 
or mean value of the dependent variable y is related to the values of the independent vari- 


ables x,,X,,..., Xp Sample data and the least squares method are used to develop the es- 
timated regression equation ŷ F by + bix; + box, + +++ + bx In effect bo, bi, by, ...,b, 
are sample statistics used to estimate the unknown model parameters Bp, B1; Bz ..., Bp 


Excel output was used throughout the chapter to emphasize the fact that computer software 
packages are the only realistic means of performing the numerous computations required in 
multiple regression analysis. 

The multiple coefficient of determination was presented as a measure of the goodness of 
fit of the estimated regression equation. It determines the proportion of the variation of y that 
can be explained by the estimated regression equation. The adjusted multiple coefficient of de- 
termination is a similar measure of goodness of fit that adjusts for the number of independent 
variables and thus avoids overestimating the impact of adding more independent variables. 

An F test and a t test were presented as ways to determine statistically whether the 
relationship among the variables is significant. The F test is used to determine whether 
there is a significant overall relationship between the dependent variable and the set of all 
independent variables. The ¢ test is used to determine whether there is a significant rela- 
tionship between the dependent variable and an individual independent variable given the 
other independent variables in the regression model. Correlation among the independent 
variables, known as multicollinearity, was discussed. 

The section on categorical independent variables showed how dummy variables can be 
used to incorporate categorical data into multiple regression analysis. The section on resid- 
ual analysis showed how residual analysis can be used to validate the model assumptions 
and detect outliers. Finally, we discussed the implications of large data sets on the applica- 
tion and interpretation of multiple regression analysis. 


GLOSSARY 


SCOHESOHSHHSHSHSHOSHSHSSHSHHSHSHSHSHSHSHSHSHHSHSHSHSHSHSHSSSEOSSHEEE®E 


Adjusted multiple coefficient of determination A measure of the goodness of fit of the 
estimated multiple regression equation that adjusts for the number of independent variables in 
the model and thus avoids overestimating the impact of adding more independent variables. 
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Categorical independent variable An independent variable with categorical data. 
Dummy variable A variable used to model the effect of categorical independent variables. 
A dummy variable may take only the value zero or one. 

Estimated multiple regression equation The estimate of the multiple regression equation 
based on sample data and the least squares method; it is ¥ = by + b,x, + bax, +--+ + b,x, 
Least squares method The method used to develop the estimated regression equation. It 
minimizes the sum of squared residuals (the deviations between the observed values of the 
dependent variable, y,, and the estimated values of the dependent variable, 9). 
Multicollinearity The term used to describe the correlation among the independent variables. 
Multiple coefficient of determination A measure of the goodness of fit of the estimated 
multiple regression equation. It can be interpreted as the proportion of the variability in the 
dependent variable that is explained by the estimated regression equation. 

Multiple regression analysis Regression analysis involving two or more independent 
variables. 

Multiple regression equation The mathematical equation relating the expected value or 
mean value of the dependent variable to the values of the independent variables; that is, 
EOY) = Bo + Byx, + Bax +--+ + Bx, 

Multiple regression model The mathematical equation that describes how the dependent 
variable y is related to the independent variables x,, x5, . . . , x, and an error term €. 


Multiple Regression Model 


y = Bo + Bix, + Baa + ++ + By, te (15.1) 
Multiple Regression Equation 
El) = Bo + Bix, + Baxa + e BX, (15.2) 


Estimated Multiple Regression Equation 
Y= by + Dx, + bax, + + bx, (15.3) 
Least Squares Criterion 
min X(y; — fY (15.4) 
Relationship Among SST, SSR, and SSE 
SST = SSR + SSE (15.7) 


Multiple Coefficient of Determination 


, SSR 
= —— (15.8) 
SST 
Adjusted Multiple Coefficient of Determination 
—1 
R=1-(1-R)}— (15.9) 
napa 
Mean Square Due to Regression 
SSR 
MSR = — (15.12) 
P 
Mean Square Due to Error 
SSE 
MSE = — (15.13) 
n=p=]1 
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F Test Statistic 
MSR 
= —_ (15.14) 
MSE 
t Test Statistic 
b. 
t=— (15.15) 
Sp 


SUPPLEMENTARY EXERCISES 


44. Predicting College Grade Point Average. The admissions officer for Clearwater Col- 
lege developed the following estimated regression equation relating the final college 
GPA to the student’s SAT mathematics score and high-school GPA. 


y= —1.41 + .0235x, + .00486x, 
where 

x, = high-school grade point average 

xX, = SAT mathematics score 

y = final college grade point average 


a. Interpret the coefficients in this estimated regression equation. 
b. Predict the final college GPA for a student who has a high-school average of 84 and 
a score of 540 on the SAT mathematics test. 
45. Predicting Job Satisfaction. The personnel director for Electronics Associates devel- 
oped the following estimated regression equation relating an employee’s score on a 
job satisfaction test to his or her length of service and wage rate. 


y= 144 — 8.69x, + 13.5x, 
where 
x, = length of service (years) 
xX, = wage rate (dollars) 
y = job satisfaction test score (higher scores 
indicate greater job satisfaction) 
a. Interpret the coefficients in this estimated regression equation. 


b. Predict the job satisfaction test score for an employee who has four years of service 
and makes $6.50 per hour. 


46. A partial computer output from a regression analysis using Excel’s Regression tool follows. 


4 A B c 


D E G | 


| 


OMDADUN EWN 
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a. Compute the missing entries in this output. 
b. Using a = .05, test for overall significance. 
c. Use the ¢ test and a = .05 to test Hy: B, = 0 and Hg: B, = 0. 

47. Analyzing College Grad Point Average. Recall that in exercise 44, the admissions 
officer for Clearwater College developed the following estimated regression equation 
relating final college GPA to the student’s SAT mathematics score and high-school GPA. 


$ = —1.41 + .0235x, + .00486x, 


where 


x, = high-school grade point average 
x, = SAT mathematics score 
y = final college grade point average 


A portion of the Excel Regression tool output follows. 


a. Complete the missing entries in this output. 
b. Using a = .05, test for overall significance. 
c. Did the estimated regression equation provide a good fit to the data? Explain. 
d. Use the ¢ test and a = .05 to test H,: 8, = 0 and Hy: B, = 0. 

48. Analyzing Job Satisfaction. Recall that in exercise 45 the personnel director for Elec- 
tronics Associates developed the following estimated regression equation relating an 
employee’s score on a job satisfaction test to length of service and wage rate. 


$ = 14.4 — 8.69x, + 13.5x, 


where 


x, = length of service (years) 

x, = wage rate (dollars) 

y = job satisfaction test score (higher scores 
indicate greater job satisfaction) 


a. Complete the missing entries in this output. 
b. Using a = .05, test for overall significance. 
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c. Did the estimated regression equation provide a good fit to the data? Explain. 
d. Use the ¢ test and a = .05 to test H,: B, = 0 and Hy: B, = 0. 


A B cS D E G 
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DATA fi le 49. Gift Card Sales. For the holiday season of 2017, nearly 59 percent of consumers 
planned to buy gift cards. According to the National Retail Federation, millennials 

like to purchase gift cards (Dayton Daily News website). Consider the sample data in 

the file GiftCards. The following data are given for a sample of 600 millennials: the 

amount they reported spending on gift cards over the last year, annual income, marital 

status (1 = yes, 0 = no), and whether they are male (1 = yes, 0 = no). 

a. Develop an estimated regression equation that predicts annual spend on gift cards 

given annual income, marital status, and gender. 

b. Test the overall significance at the .05 level. 

c. Test the significance of each individual variable using a .05 level of significance. 

es . 50. Zoo Attendance. The Cincinnati Zoo and Botanical Gardens had a record attendance 

SDATAfile of 1.87 million visitors in 2017 (Cincinnati Business Courier website). Nonprofit 

ZooSpend organizations such as zoos and museums are becoming more sophisticated in their use 
of data to improve the customer experience. Being able to better estimate expected 
revenue is one use of analytics that allows nonprofits to better manage their operations. 

The file ZooSpend contains sample data on zoo attendance. The file contains the 

following data on 125 visits by families to a zoo: amount spent, size of the family, the 
distance the family lives from the zoo (the gate attendee asks for the zip code of each 
family entering the zoo), and whether or not the family has a zoo membership 

(1 = yes, 0 = no). 

a. Develop an estimated regression equation that predicts the amount of money spent 
by a family given family size, whether or not it has a zoo membership, and the 
distance the family lives from the zoo. 

Test the significance of the zoo membership independent variable at the .05 level. 
Give an explanation for the sign of the estimate you tested in part (b). 

. Test the overall significance of the model at the .05 level. 

Estimate the amount of money spent in a visit by a family of five that lives 

125 miles from the zoo and does not have a zoo membership. 


GiftCards 


eaas 
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51. Analyzing Repeat Purchases. The Tire Rack, an online distributor of tires and 
wheels, conducts extensive testing to provide customers with products that are 
right for their vehicle, driving style, and driving conditions. In addition, The 
Tire Rack maintains an independent consumer survey to help drivers help each 
other by sharing their long-term tire experiences (The Tire Rack website). The 
following data show survey ratings (1 to 10 scale with 10 the highest rating) for 
18 high-performance all-season tires. The variable Tread Wear rates quickness of 
wear based on the driver’s expectations, the variable Dry Traction rates the grip 
of a tire on a dry road, the variable Steering rates the tire’s steering responsive- 
ness, and the variable Buy Again rates the driver’s desire to purchase the same 


tire again. 
Tread Dry Buy 
Tire Wear Traction Steering Again? 
Sumitomo HTR A/S P02 8.6 9.0 8.5 8.0 
Goodyear Eagle Sport All-Season 8.5 9.3 8.8 US 
Michelin Pilot Sport A/S 3 US) 22 8.9 72. 
Kumho Ecsta PA31 72 8.4 eg) 6.4 
Firestone Firehawk Wide Oval AS 7.4 7.6 Hall 5.8 
BFGoodrich g-Force Super Sport A/S Til 8.7 8.3 o5 
<> : Yokohama AVID ENVigor 6.4 8.7 8.4 5.8 
= DATA file Goodyear Eagle RS-A2 6.6 8.4 VE) 5.4 
AlSoasonTires Hankook Ventus V2 concept2 Te 8.0 TE, oe 
Firestone Firehawk GT 6.1 8.2 ES) 4.4 
Firestone FR740 Sal 7.8 Wp) 32 
Dunlop SP Sport 7000 A/S 5.0 7.6 7.0 3.0 
Goodyear Eagle RS-A 5.6 7.4 7.0 3.6 
Bridgestone Potenza RE92A 57 Hal 6.5 2.6 
Yokohama ADVAN A83A 4.4 8.0 7.6 1.0 
Bridgestone Potenza RE92 5.4 6.9 6.3 25 
Firestone Firehawk GTA-03 3.8 TAG. 6.9 2.0 


a. Develop an estimated simple linear regression equation that can be used to predict 
the Buy Again rating given the Tread Wear rating. At the .01 level of significance, 
test for a significant relationship. Does this estimated regression equation provide a 
good fit to the data? Explain. 

b. Develop an estimated multiple regression equation that can be used to predict the 
Buy Again rating given the Tread Wear rating and the Dry Traction rating. Is the 
addition of the Dry Traction independent variable significant at a = .01? Explain. 

c. Develop an estimated multiple regression equation that can be used to predict the 
Buy Again rating given the Tread Wear rating, the Dry Traction rating, and the 
Steering rating. Is the addition of the Steering independent variable significant at 
a = .01? Explain. 

52. Analyzing Winning Percentage in the NBA. The National Basketball Association 
(NBA) records a variety of statistics for each team. Five of these statistics are the 
percentage of games won (Win%), the percentage of field goals made (FG%), the per- 
centage of three-point shots made (3P%), the percentage of free throws made (FT%), 
the average number of offensive rebounds per game (RBOff), and the average number 
of defensive rebounds per game (RBDef). The data contained in the file NBAStats 
show the values of these statistics for the 30 teams in the NBA for one full season. 

A portion of the data follows. 
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Team Win% FG% 3P% FT% RBOff RBDef 
Atlanta 60.6 45.4 37.0 74.0 DS) Silo 
=< DATA file aa Age 46.0 36.7 77.8 Holl Sle 
NBAStats 

Toronto 34.8 44.0 34.0 77.0 10.6 31.4 

Utah 54.5 45.6 329 75.4 13.0 Sel 

Washington 30.3 44.1 32.0 Well N7 29.9 

a. Develop an estimated regression equation that can be used to predict the percentage 
of games won given the percentage of field goals made. At the .05 level of signifi- 
cance, test for a significant relationship. 

b. Provide an interpretation for the slope of the estimated regression equation developed 
in part (a). 

c. Develop an estimated regression equation that can be used to predict the percentage 
of games won given the percentage of field goals made, the percentage of three-point 
shots made, the percentage of free throws made, the average number of offensive re- 
bounds per game, and the average number of defensive rebounds per game (RBDef). 

d. For the estimated regression equation developed in part (c), remove any independent 
variables that are not significant at the .05 level of significance and develop a new 
estimated regression equation using the remaining independent variables. 

e. Assuming the estimated regression equation developed in part (d) can be used for 
the 2012-2013 season, predict the percentage of games won for a team with the 
following values for the four independent variables: FG% = 45, 3P% = 35, 
RBOff = 12, and RBDef = 30. 

CASE PROBLEM 1: CONSUMER RESEARCH, INC. 
Consumer Research, Inc., is an independent agency that conducts research on consumer atti- 
tudes and behaviors for a variety of firms. In one study, a client asked for an investigation of 
consumer characteristics that can be used to predict the amount charged by credit card users. 
Data were collected on annual income, household size, and annual credit card charges for a 
sample of 50 consumers. The following data are contained in the file Consumer. 
Income Household Amount Income Household Amount 
($1000s) Size Charged ($) ($1000s) Size Charged ($) 
54 3 4016 54 6 5573 
30 2 3152 30 1 2583 
32 4 5100 48 2 3866 
50 5 4742 34 5 3586 
31 2 1864 67 4 5037 
T 7 55 2 4070 50 2 3605 
SSDAIA file 37 1 2731 67 5 5345 
Gonsümer 40 2 3348 55 6 5370 
66 4 4764 52 2 3890 
5i 3 4110 62 3 4705 
25 3 4208 64 2 4157 
48 4 4219 22 3 8579. 
27 1 2477 29 4 3890 
(continued) 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


730 Chapter 15 Multiple Regression 


Income Household Amount Income Household Amount 
($1000s) Size Charged ($) | ($1000s) Size Charged ($) 
33 2 2514 39 2 2972 
65 3 4214 35 1 3121 
63 4 4965 39 4 4183 
42 6 4412 54 3 3730 
21 2 2448 23 6 4127 
44 1 2995 27 2 2921 
37. 5 4171 26 7 4603 
62 6 5678 61 2 4273 
2i 3 3623 30 2 3067 
55 7 5301 22 4 3074 
42 2 3020 46 5 4820 
41 ih 4828 66 4 5149 


Managerial Report 

1. Use methods of descriptive statistics to summarize the data. Comment on the findings. 

2. Develop estimated regression equations, first using annual income as the independent 
variable and then using household size as the independent variable. Which variable is 
the better predictor of annual credit card charges? Discuss your findings. 

3. Develop an estimated regression equation with annual income and household size as the 
independent variables. Discuss your findings. 

4. What is the predicted annual credit card charge for a three-person household with an 
annual income of $40,000? 

5. Discuss the need for other independent variables that could be added to the model. 
What additional variables might be helpful? 


CASE PROBLEM 2: PREDICTING WINNINGS FOR 


His win was no surprise because for the 2011 season he finished fourth in the point standings 
with 2330 points, behind Tony Stewart (2403 points), Carl Edwards (2403 points), and Kevin 
Harvick (2345 points). In 2011 he earned $6,183,580 by winning three Poles (fastest driver 
in qualifying), winning three races, finishing in the top five 12 times, and finishing in the top 
ten 20 times. NASCAR’s point system in 2011 allocated 43 points to the driver who finished 
first, 42 points to the driver who finished second, and so on down to 1 point for the driver 
who finished in the 43rd position. In addition any driver who led a lap received 1 bonus point, 
the driver who led the most laps received an additional bonus point, and the race winner was 
awarded 3 bonus points. But the maximum number of points a driver could earn in any race 
was 48. Table 15.8 shows data for the 2011 season for the top 35 drivers (NASCAR website). 


Managerial Report 

1. Suppose you wanted to predict Winnings ($) using only the number of poles won 
(Poles), the number of wins (Wins), the number of top five finishes (Top 5), or the num- 
ber of top ten finishes (Top 10). Which of these four variables provides the best single 
predictor of winnings? 

2. Develop an estimated regression equation that can be used to predict Winnings ($) given 
the number of poles won (Poles), the number of wins (Wins), the number of top five 
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TABLE 15.8 NASCAR Results for the 2011 Season 


Driver Points Poles Wins Top 5 Top 10 Winnings ($) 
Tony Stewart 2403 1 5 9 19 6,529,870 
Carl Edwards 2403 3 1 19 26 8,485,990 
Kevin Harvick 2345 0) 4 9 19) 6,197,140 
Matt Kenseth 2330 3 3 12 20 6,183,580 
Brad Keselowski 2319 1 3 10 14 5,087,740 
Jimmie Johnson 2304 0 2 14 21 6,296,360 
Dale Earnhardt Jr. 2290 1 (0) 4 12 4,163,690 
Jeff Gordon 2287 1 3 13 18 5,912,830 
Denny Hamlin 2284 0 1 5 14 5,401,190 
Ryan Newman 2284 3 il 9 I7 5,303,020 
Kurt Busch 2262 3 2 8 16 5,936,470 
Kyle Busch 2246 1 4 14 18 6,161,020 
Clint Bowyer 1047 0 1 4 16 5,633,950 
Kasey Kahne 1041 2 {| 8 15 4,775,160 
= DATA file A. J. Allmendinger 1013 0 0 1 10 4,825,560 
NASCAR Greg Biffle 997 3 0 3 10 4,318,050 
Paul Menard 947 0 1 4 8 3,853,690 
Martin Truex Jr. 937 1 0 3 12 3,955,560 
Marcos Ambrose 936 0 1 5 (2 4,750,390 
Jeff Burton 995 (0) 0 2 5 3,807,780 
Juan Montoya 932 2 0 2 8 5,020,780 
Mark Martin 930 2 0 2 10 3,830,910 
David Ragan 906 2 il 4 8 4,203,660 
Joey Logano 902 2 0 4 6 3,856,010 
Brian Vickers 846 0 0 8 if 4,301,880 
Regan Smith 820 0 1 2 5 4,579,860 
Jamie McMurray 795 1 0 2 4 4,794,770 
David Reutimann 757 1 0 il 3 4,374,770 
Bobby Labonte 670 0 0 1 2 4,505,650 
David Gilliland 572 0 0 1 2 3,878,390 
Casey Mears 541 o o o o 2,838,320 
Dave Blaney 508 0 0 1 1 3,229,210 
Andy Lally 398 0 0 0 0 2,868,220 
Robby Gordon 268 0 0 0 0 2,271,890 
J. J. Yeley 192 0 0 0 0 2,559,500 


finishes (Top 5), and the number of top ten (Top 10) finishes. Test for individual signifi- 
cance and discuss your findings and conclusions. 

3. Create two new independent variables: Top 2-5 and Top 6-10. Top 2-5 represents 
the number of times the driver finished between second and fifth place and Top 6—10 
represents the number of times the driver finished between sixth and tenth place. 
Develop an estimated regression equation that can be used to predict Winnings ($) using 
Poles, Wins, Top 2-5, and Top 6-10. Test for individual significance and discuss your 
findings and conclusions. 

4. Based upon the results of your analysis, what estimated regression equation would you 
recommend using to predict Winnings ($)? Provide an interpretation of the estimated 
regression coefficients for this equation. 


Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). 
Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 


732 Chapter 15 Multiple Regression 


CASE PROBLEM 3:FINDING THE BEST CAR VALUE 
When trying to decide what car to buy, real value is not necessarily determined by how 
much you spend on the initial purchase. Instead, cars that are reliable and don’t cost much 
to own often represent the best values. But no matter how reliable or inexpensive a car may 
be to own, it must also perform well. 

To measure value, Consumer Reports developed a statistic referred to as a value 
score. The value score is based upon five-year owner costs, overall road-test scores, 
and predicted-reliability ratings. Five-year owner costs are based upon the expenses 
incurred in the first five years of ownership, including depreciation, fuel, mainte- 
nance and repairs, and so on. Using a national average of 12,000 miles per year, 
an average cost per mile driven is used as the measure of five-year owner costs. 
Road-test scores are the results of more than 50 tests and evaluations and are based 
on a 100-point scale, with higher scores indicating better performance, comfort, 
convenience, and fuel economy. The highest road-test score obtained in the tests 
conducted by Consumer Reports was a 99 for a Lexus LS 460L. Predicted-reliability 
ratings (1 = Poor, 2 = Fair, 3 = Good, 4 = Very Good, and 5 = Excellent) are 
based upon data from Consumer Reports’ Annual Auto Survey. 

; A car with a value score of 1.0 is considered to be an “average-value” car. A car with 

DATA f ile a value score of 2.0 is considered to be twice as good a value as a car with a value score 

CarValues of 1.0; a car with a value score of 0.5 is considered half as good as average; and so on. 
The data for three sizes of cars (13 small sedans, 20 family sedans, and 21 upscale 
sedans), including the price ($) of each car tested, are contained in the file CarValues 
(Consumer Reports website). To incorporate the effect of size of car, a categorical 
variable with three values (small sedan, family sedan, and upscale sedan), use the 
following dummy variables: 


(@ 


1 if the car is a Family Sedan 


Family-Sedan = 
a i ( otherwise 


1 if the car is an Upscale Sedan 


Upscale-Sedan aE 

Managerial Report 

1. Treating Cost/Mile as the dependent variable, develop an estimated regression with 
Family-Sedan and Upscale-Sedan as the independent variables. Discuss your findings. 

2. Treating Value Score as the dependent variable, develop an estimated regression 
equation using Cost/Mile, Road-Test Score, Predicted Reliability, Family-Sedan, and 
Upscale-Sedan as the independent variables. 

3. Delete any independent variables that are not significant from the estimated regression 
equation developed in part 2 using a .05 level of significance. After deleting any inde- 
pendent variables that are not significant, develop a new estimated regression equation. 

4. Suppose someone claims that “smaller cars provide better values than larger cars.” For 
the data in this case, the Small Sedans represent the smallest type of car and the Upscale 
Sedans represent the largest type of car. Does your analysis support this claim? 
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Appendix B- 


TABLE 1 Cumulative Probabilities for the Standard Normal Distribution 


Entries in the table 

give the area under the 
curve to the left of the 

z value. For example, for 
z = —.85, the cumulative 
probability is .1977. 


Cumulative 
probability 


—3.0 .0013  .0013  .0013 0012 = .0012 .0011 0011 0011 .0010 .0010 


—2.9 0019 .0018 .0018 .0017 .0016 0016 = =.0015 .0015 = .0014 0014 
—2.8 0026 .0025 .0024 .0023 .0023 0022 = .0021 0021 .0020 .0019 
—2.7 0035 .0034 .0033 .0032 = .0031 .0030 .0029 .0028 = .0027 0026 
—2.6 .0047 0045 .0044 .0043 .0041 .0040 .0039 .0038 = .0037 .0036 
—2.5  .0062 .0060 .0059 .0057 .0055 0054 = .0052 .0051 .0049 .0048 


—2.4 .0082 .0080 .0078 .0075 .0073 .0071 .0069 .0068  .0066 .0064 
—2.3 .0107 .0104 = =.0102 .0099 .0096 .0094 .0091 .0089 .0087 .0084 
=2.2 0139 0136. .0132 0129 0125 0122 0119 .0116  .0113 .0110 
—2.1 .0179 .0174 .0170 .0166 .0162 .0158  .0154 .0150 .0146 .0143 
—2.0 .0228 .0222 0217 .0212 .0207 .0202 .0197 .0192 .0188 .0183 


9 0287 .0281 .0274 .0268 .0262 .0256 .0250 .0244 .0239 .0233 
8  =.0359 .0351 .0344 .0336 .0329 0322 = .0314 .0307 .0301 0294 
—1.7 0446 0436 .0427 .0418  .0409 .0401 0392 .0384 .0375 .0367 
6 .0548 .0537 .0526 0516 = .0505 0495 = .0485 .0475 .0465 .0455 
5 .0668 .0655 .0643 .0630 .0618 .0606 .0594 0582 .0571 .0559 


4 .0808 .0793  .0778 0764 .0749 0735 .0721 0708 .0694 .0681 

3 .0968 .0951 .0934 .0918 .0901 .0885 .0869 .0853 .0838 .0823 
=1.2. 1151 1131 -1112 (1093 -A075 .1056 .1038 .1020 .1003 .0985 

1 

0 


1357) 41335 1314 1292, 1271 A251 .1230 1210 -1190 .1170 
:1587 1562 1539 .1515 .1492 .1469 .1446 .1423 .1401 .1379 


—.9 .1841 .1814 1788 .1762 1736 1711 .1685 .1660 .1635 .1611 
—8 .2119 2090 2061 .2033 .2005 .1977  .1949 1922 — .1894 .1867 
—.7 .2420 .2389 2358 2327 .2296 2266 —.2236 2206 .2177 .2148 
—.6  .2743 .2709 .2676 .2643  .2611 .2578 = .2546 2514 — .2483 2451 
—.5 3085 3050 .3015 2981 2946 2912 2877 .2843 = .2810 .2776 


—.4 3446 3409 .3372 3336 3300 3264 .3228 3192  .3156 3121 
=3 3824 3783  .3745 .3707  .3669 .3632  .3594 3557 -173520 .3483 
—.2 4207 .4168  .4129 .4090 .4052 .4013  .3974 3936  .3897 .3859 
—.1 .4602 .4562 = .4522 4483  .4443 .4404  .4364 .4325 4286 .4247 
—.0 .5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641 
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Appendix B_ Tables 


TABLE 1 Cumulative Probabilities for the Standard Normal Distribution 


(Continued) 


Cumulative 


probability 


Entries in the table 

give the area under the 
curve to the left of the 

z value. For example, for 
z = 1.25, the cumulative 
probability is .8944. 


z .00 
(0) .5000 
1 5398 
2 5793 
3 .6179 
4 6554 
5 6915 
6 7257 
7 .7580 
8 .7881 
9 8159 
1.0  .8413 
1.1 .8643 
1.2 .8849 
1.3 .9032 
1.4 .9192 
1.5 9332 
1.6 9452 
1.7 9554 
1.8 .9641 
1.9 .9713 
2.0 .9772 
2.1 9821 
2.2  .9861 
2.3 9893 
2.4 .9918 
2.5 .9938 
2.6 .9953 
2.7 .9965 
2.8 .9974 
2.9 .9981 


3.0 .9987 
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.5040 
.5438 
.5832 
.6217 
.6591 


.6950 
.7291 
.7611 
.7910 
.8186 


.8438 
.8665 
.8869 
9049 
.9207 


9345 
9463 
9564 
.9649 
9719 


9778 
9826 
.9864 
9896 
9920 


9940 
9955 
9966 
9975 
9982 


9987 


.5080 
5478 
5871 
.6255 
.6628 


.6985 
1324 
7642 
7939 
8212 


8461 
.8686 
.8888 
.9066 
9222 


9357 
9474 
9573 
9656 
9726 


9783 
9830 
.9868 
.9898 
9922 


9941 
9956 
9967 
9976 
9982 


9987 


.5120 
5517 
5910 
.6293 
.6664 


7019 
1357 
7673 
1967 
8238 


8485 
.8708 
.8907 
.9082 
9236 


.9370 
9484 
9582 
9664 
9732 


.9788 
9834 
.9871 
9901 
9925 


9943 
9957 
9968 
9977 
9983 


9988 


.5160 
5557 
5948 
.6331 
.6700 


7054 
1389 
7704 
7995 
8264 


8508 
8729 
8925 
9099 
9251 


9382 
9495 
9591 
.9671 
9738 


9793 
.9838 
9875 
.9904 
9927 


9945 
9959 
9969 
9977 
9984 


9988 


5199 
5596 
5987 
.6368 
.6736 


./088 
7422 
7734 
.8023 
.8289 


8531 
8749 
8944 
9115 
9265 


9394 
.9505 
9599 
.9678 
9744 


9798 
9842 
.9878 
9906 
9929 


9946 
.9960 
9970 
9978 
9984 


9989 


5239 5279 319 
5636 5675 5714 
.6026 .6064 .6103 
.6406 .6443 .6480 
.6772 .6808 .6844 


1123 157 7190 
7454 7486 7517 
7764 1794 7823 
.8051 .8078 8106 
8315 .8340 8365 


8554 8577 8599 
.8770 8790 .8810 
8962 8980 8997 
9131 9147 9162 
9279 9292 9306 


9406 9418 9429 
9515 19525 :9535 
.9608 .9616 :9625 
.9686 .9693 .9699 
.9750 .9756 .9761 


.9803 .9808 :9812 
.9846 .9850 .9854 
.9881 .9884 .9887 
.9909 9911 9913 
9931 9932 9934 


9948 9949 9951 
9961 9962 9963 
9971 9972 9973 
9979 9979 9980 
9985 9985 9986 


9989 9989 9990 


737 


5359 
5753 
6141 
.6517 
.6879 


1224 
1549 
1852 
8133 
.8389 


8621 
.8830 
9015 
9177 
9319 


9441 
9545 
9633 
9706 
9767 


9817 
9857 
9890 
9916 
9936 


9952 
9964 
9974 
9981 
9986 


9990 
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TABLE 2 t Distribution 


Area or — Entries in the table give t values for an area 
probability or probability in the upper tail of the t 
distribution. For example, with 10 degrees 
of freedom and a .05 area in the upper tail, 
tos = 1.812. 


Area in Upper Tail 


Degrees ee 
of Freedom .20 -10 .05 .025 .01 .005 

1 1.376 3.078 6.314 12.706 31.821 63.656 

2 1.061 1.886 2.920 4.303 6.965 9.925 

3 978 1.638 2.353 3.182 4.541 5.841 

4 941 1.533 2.132 2.776 3.747 4.604 

5 920 1.476 2.015 2.571 3.365 4.032 

6 906 1.440 1.943 2.447 3.143 3.707 

7 896 1.415 1.895 2.365 2.998 3.499 

8 .889 1.397 1.860 2.306 2.896 3.355 

9 .883 1.383 1.833 2.262 2.821 3.250 
10 .879 1.372 1.812 2.228 2.764 3.169 
11 876 1.363 1.796 2.201 2.718 3.106 
12 .873 1.356 1.782 2.179 2.681 3.055 
13 .870 1.350 1.771 2.160 2.650 3.012 
14 .868 1.345 1.761 2.145 2.624 2.977 
15 .866 1.341 1.753 2.131 2.602 2.947 
16 .865 1.337 1.746 2.120 2.583 2.921 
17 .863 1.333 1.740 2.110 2.567 2.898 
18 .862 1.330 1.734 2.101 2.552 2.878 
19 .861 1.328 1.729 2.093 2.539 2.861 
20 .860 1.325 1.725 2.086 2.528 2.845 
21 .859 1,323 1.721 2.080 2.518 2.831 
22 .858 1:321 1.717 2.074 2.508 2.819 
23 .858 1.319 1.714 2.069 2.500 2.807 
24 .857 1.318 1.711 2.064 2.492 2.797 
25 .856 1.316 1.708 2.060 2.485 2.787 
26 .856 1.315 1.706 2.056 2.479 2.779 
27 .855 1.314 1.703 2.052 2.473 2.771 
28 .855 1:313 1.701 2.048 2.467 2.763 
29 .854 1.311 1.699 2.045 2.462 2.756 
30 .854 1.310 1.697 2.042 2.457 2.750 
31 .853 1.309 1.696 2.040 2.453 2.744 
32 .853 1.309 1.694 2.037 2.449 2.738 
33 .853 1.308 1.692 2.035 2.445 2.733 


34 852 1.307 1.691 2.032 2.441 2.728 
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TABLE 2 t Distribution (Continued) 


Degrees 
of Freedom 


35 
36 
37 
38 
39 


40 
41 
42 
43 
44 


45 
46 
47 
48 
49 


50 
51 
52 
53 
54 


55 
56 
57 
58 
59 


60 
61 
62 
63 
64 


65 
66 
67 
68 
69 


70 
71 
72 
73 
74 


75 
76 
77 
78 
79 
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.20 


.852 
.852 
.851 
.851 
.851 


.851 
.850 
.850 
.850 
.850 


.850 
.850 
.849 
.849 
.849 


.849 
.849 
.849 
.848 
.848 


.848 
.848 
.848 
.848 
.848 


.848 
.848 
.847 
.847 
.847 


.847 
.847 
.847 
.847 
.847 


.847 
.847 
.847 
.847 
.847 


.846 
.846 
.846 
.846 
.846 


-10 


1.306 
1.306 
1.305 
1.304 
1.304 


1.303 
1.303 
1.302 
1.302 
1.301 


1.301 
1.300 
1.300 
1.299 
1.299 


1.299 
1.298 
1.298 
1.298 
1.297 


1:297 
1:297 
1.297 
1.296 
1.296 


1.296 
1.296 
1.295 
1.295 
1.295 


1.295 
1.295 
1.294 
1.294 
1.294 


1.294 
1.294 
1.293 
1.293 
1.293 


1:293 
1.293 
1.293 
1.292 
1.292 


Area in Upper Tail 


.05 


1.690 
1.688 
1.687 
1.686 
1.685 


1.684 
1.683 
1.682 
1.681 
1.680 


1.679 
1.679 
1.678 
1.677 
1.677 


1.676 
1.675 
1.675 
1.674 
1.674 


1.673 
1.673 
1.672 
1.672 
1.671 


1.671 
1.670 
1.670 
1.669 
1.669 


1.669 
1.668 
1.668 
1.668 
1.667 


1.667 
1.667 
1.666 
1.666 
1.666 


1.665 
1.665 
1.665 
1.665 
1.664 


.025 


2.030 
2.028 
2.026 
2.024 
2.023 


2.021 
2.020 
2.018 
2.017 
2.015 


2.014 
2.013 
2.012 
2.011 
2.010 


2.009 
2.008 
2.007 
2.006 
2.005 


2.004 
2.003 
2.002 
2.002 
2.001 


01 


2.438 
2.434 
2.431 
2.429 
2.426 


2.423 
2.421 
2.418 
2.416 
2.414 


2.412 
2.410 
2.408 
2.407 
2.405 


2.403 
2.402 
2.400 
2.399 
2.397 


2.396 
2.395 
2.394 
2.392 
2.391 


2.390 
2.389 
2.388 
2.387 
2.386 


2.385 
2.384 
2.383 
2.382 
2.382 


2.381 
2.380 
2.379 
2.379 
2.378 


2.377 
2.376 
2.376 
2.375 
2.374 


739 


.005 


2.724 
2.719 
2:715 
2.712 
2.708 


2.704 
2.701 
2.698 
2.695 
2.692 


2.690 
2.687 
2.685 
2.682 
2.680 


2.678 
2.676 
2.674 
2.672 
2.670 


2.668 
2.667 
2.665 
2.663 
2.662 


2.660 
2.659 
2:657 
2.656 
2.655 


2.654 
2.652 
2.651 
2.650 
2.649 


2.648 
2.647 
2.646 
2.645 
2.644 


2.643 
2.642 
2.641 
2.640 
2.639 
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TABLE 2 t Distribution (Continued) 


Area in Upper Tail 


Degrees AAZ 

of Freedom .20 -10 .05 .025 01 .005 
80 B46 1.292 1.664 1.990 2.374 2.639 
81 846 1.292 1.664 1.990 2.373 2.638 
82 846 1.292 1.664 1.989 2.373 2.637 
83 846 1.292 1.663 1.989 2.372 2.636 
84 846 1.292 1.663 1.989 2.372 2.636 
85 846 1.292 1.663 1.988 2.371 2.635 
86 846 1.291 1.663 1.988 2.370 2.634 
87 846 1.291 1.663 1.988 2.370 2.634 
88 846 1.291 1.662 1.987 2.369 2.633 
89 846 1.291 1.662 1.987 2.369 2.632 
90 846 1.291 1.662 1.987 2.368 2.632 
91 846 1.291 1.662 1.986 2.368 2.631 
92 846 1.291 1.662 1.986 2.368 2.630 
93 846 1.291 1.661 1.986 2.367 2.630 
94 .845 1.291 1.661 1.986 2.367 2.629 
95 845 1.291 1.661 1.985 2.366 2.629 
96 845 1.290 1.661 1.985 2.366 2.628 
97 845 1.290 1.661 1.985 2.365 2.627 
98 845 1.290 1.661 1.984 2.365 2.627 
99 845 1.290 1.660 1.984 2.364 2.626 
100 845 1.290 1.660 1.984 2.364 2.626 
oo 842 1.282 1.645 1.960 2.326 2.576 
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TABLE 3 Chi-Square Distribution 


Area or 
probability 


Xa 


Entries in the table give xZ values, where a is the area or probability in the upper tail of the chi-square distribution. 
For example, with 10 degrees of freedom and a .01 area in the upper tail, yj, = 23.209. 


Area in Upper Tail 


Degrees NN 

of Freedom 995 99 975 95 90 10 .05 .025 .01 .005 
1 .000 .000 .001 .004 016 2.706 3.841 5.024 6.635 7.879 

2 .010 .020 051 .103 211 4.605 5.991 7.378 9.210 10.597 

3 072 115 216 352 .584 6.251 7.815 9.348 11.345 12.838 

4 .207 297 484 14 1.064 7.779 9.488 11.143 13.277 14.860 

5 412 554 831 1.145 1.610 9.236 11.070 12.832 15.086 16.750 

6 676 872 1.237 1.635 2.204 10.645 12.592 14.449 16.812 18.548 

7 989 1.239 1.690 2.167 2.833 12.017 14.067 16.013 18.475 20.278 

8 1.344 1.647 2.180 2.733 3.490 13.362 15.507 17.535 20.090 21.955 

9 1.735 2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.666 23.589 

10 2.156 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 25.188 

1 2.603 3.053 3.816 4.575 5.578 17.275 19.675 21.920 24.725 26.757 


1 

2 3.074 3.571 4.404 5.226 6.304 18.549 21.026 23.337 26.217 28.300 
13 3.565 4.107 5.009 5.892 7.041 19.812 22.362 24.736 27.688 29.819 
14 

5 


4.075 4.660 5.629 6.571 7.790 21.064 23.685 26.119 29.141 31.319 
4.601 5.229 6.262 7.261 8.547 22.307 24.996 27.488 30.578 32.801 


16 5.142 5.812 6.908 7.962 9.312 23.542 26.296 28.845 32.000 34.267 
17 5.697 6.408 7.564 8.672 10.085 24.769 27.587 30.191 33.409 35.718 
18 6.265 7.015 8.231 9.390 10.865 25.989 28.869 31.526 34.805 37.156 
19 6.844 7.633 8.907 10.117 11.651 27.204 30.144 32.852 36.191 38.582 
20 7.434 8.260 9.591 10.851 12.443 28.412 31.410 34.170 37.566 39.997 
21 8.034 8.897 10.283 11.591 13.240 29.615 32.671 35.479 38.932 41.401 
22 8.643 9.542 10.982 12.338 14.041 30.813 33.924 36.781 40.289 42.796 
23 9.260 10.196 11.689 13.091 14.848 32.007 35.172 38.076 41.638 44.181 
24 9.886 10.856 12.401 13.848 15.659 33.196 36.415 39.364 42.980 45.558 
25 10.520 11.524 13.120 14.611 16.473 34.382 37.652 40.646 44.314 46.928 
26 11.160 12.198 13.844 15.379 17.292 35.563 38.885 41.923 45.642 48.290 
27 11.808 12.878 14.573 16.151 18.114 36.741 40.113 43.195 46.963 49.645 
28 12.461 13.565 15.308 16.928 18.939 37.916 41.337 44.461 48.278 50.994 


29 13.121 14.256 16.047 17.708 19.768 39.087 42.557 45.722 49.588 52.335 
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TABLE 3 Chi-Square Distribution (Continued) 


Area in Upper Tail 


Degrees NN 
of Freedom 995 99 975 95 90 10 .05 .025 .01 .005 
30 13.787 14.953 16.791 18.493 20.599 40.256 43.773 46.979 50.892 53.672 
35 17.192 18.509 20.569 22.465 24.797 46.059 49.802 53.203 57.342 60.275 
40 20.707 22.164 24.433 26.509 29.051 51.805 55.758 59.342 63.691 66.766 
45 24.311 25.901 28.366 30.612 33.350 57.505 61.656 65.410 69.957 73.166 
50 27.991 29.707 32.357 34.764 37.689 63.167 67.505 71.420 76.154 79.490 
55 31:735. -33:571 36.398 38.958 42.060 68.796 73.311 77.380 82.292 85.749 
60 35.534 37.485 40.482 43.188 46.459 74.397 79.082 83.298 88.379 91.952 
65 39.383 41.444 44603 47.450 50.883 79.973 84.821 89.177 94.422 98.105 
70 43.275 45.442 48.758 51.739 55.329 85.527 90.531 95.023 100.425 104.215 
75 47.206 49.475 52.942 56.054 59.795 91.061 96.217 100.839 106.393 110.285 
80 51.172 53.540 57.153 60.391 64.278 96.578 101.879 106.629 112.329 116.321 
85 55.170 57.634 61.389 64.749 68.777 102.079 107.522 112.393 118.236 122.324 
90 59.196 61.754 65.647 69.126 73.291 107.565 113.145 118.136 124.116 128.299 
95 63.250 65.898 69.925 73.520 77.818 113.038 118.752 123.858 129.973 134.247 


100 67.328 70.065 74.222 77.929 82.358 118.498 124.342 129.561 135.807 140.170 
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Summations 
Definition 


See wy to 4, (C.1) 
Example for x, = 5, x, = 8, x, = 14: 


3 
Sx = M+ y+ x; 


=54+8+14 
=97 
Result 1 
For a constant c: 
Mer C+ erste —ne (C.2) 
i=l — 
n times 


Example for c = 5, n = 10: 
10 
5 5 = 10(5) = 50 


i=1 


Example for c = x: 


iM: 
tall 
ll 
ži 


Result 2 


CX; = CX, + CX, + +++ + CX 


n 
n 
=1 


=e +a, ++ +a) =e Sx (C.3) 
il 
Example for x, = 5, x, = 8, x; = 14, c = 2: 
3 3 
X 2x, = 2 $, x, = 227) = 54 
Result 3 


(ax, + by) = ad x, + by y; (C.4) 
i=l i=1 i=] 
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Example for x, = 5, x, = 8,x, = 14, a = 2, yi = 7, Y2 = 3, y3 = 8, b = 4: 


3 3 3 
S 2x, + 4y) =2 Six t+ 4D y, 


i=l i=l i=1 
= 2(27) + 4(18) 
=54+72 
= 126 


Double Summations 


Consider the following data involving the variable x, where i is the subscript denoting the 
row position and j is the subscript denoting the column position: 


Column 
1 2 3 
1 X4 = 10 X12 = 8 X3 =6 
Row 
2 K Sy = 4) Son = 12 
Definition 
n m 
by Xx = (xi H Xj Fr E Xim) E Ay F Xa H+ E Xn) 
i=1 j=1 
T 31 F X32 Trask Xom) apes (Xn + Xn2 Paa Xnm) (C.5) 
Example: 
2 
Xj = Xi + X + Xi t X F X F X3 
i=1 j=1 
=]0+8+6+7+4+12 
= 47 
Definition 
n 
Si ty = Hy ty to t+ xy (C.6) 
i=l 
Example: 
2 
5 Xiz = Xp t X% 
i=l 
=8+4 
12 


Shorthand Notation 


Sometimes when a summation is for all values of the subscript, we use the following 
shorthand notations: 


2 = > 5, (C.7) 
> T= 2 (C.8) 
= xy = > Xij (C.9) 
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Appendix E- 


Microsoft Excel is a spreadsheet program that can be used to organize and analyze 
data, perform complex calculations, and create a wide variety of graphical displays. 
We assume that readers are familiar with basic Excel operations such as selecting cells, 
entering formulas, copying, and so on. But we do not assume readers are familiar with 
Excel or the use of Excel for statistical analysis. 

The purpose of this appendix is twofold. First, we provide an overview of Excel 
and discuss the basic operations needed to work with Excel workbooks and worksheets. 
Second, we provide an overview of the tools that are available for conducting statistical 
analysis with Excel. These include Excel functions and formulas which allow users to 
conduct their own analyses and add-ins that provide more comprehensive analysis tools. 

Excel’s Data Analysis add-in, included with the basic Excel system, is a valuable tool for 
conducting statistical analysis. In the last section of this appendix we provide instruction for 
installing the Data Analysis add-in. 


Overview of Microsoft Excel 


When using Excel for statistical analysis, data is displayed in workbooks, each of which 
contains a series of worksheets that typically include the original data as well as any resulting 
analysis, including charts. Figure E.1 shows the layout of a blank workbook created each 


A workbook is a file time Excel is opened. The workbook is named Book1, and contains one worksheet named 
containing one or more Sheet1. Excel highlights the worksheet currently displayed (Sheet!) by setting the name on 
worksheets. the worksheet tab in bold. Note that cell A1 is initially selected. 


The wide bar located across the top of the workbook is referred to as the Ribbon. Tabs, 
located at the top of the Ribbon, provide quick access to groups of related commands. There 
are eight tabs shown on the workbook in Figure E.1: File; Home; Insert; Page Layout; Formu- 
las; Data; Review; and View. Each tab contains a series of groups of related commands. Note 
that the Home tab is selected when Excel is opened. Figure E.2 displays the groups available 
when the Home tab is selected. Under the Home tab there are seven groups: Clipboard; Font; 
Alignment; Number; Styles; Cells; and Editing. Commands are arranged within each group. 
For example, to change selected text to boldface, click the Home tab and click the Bold | B_ 
button in the Font group. 

Figure E.3 illustrates the location of the Quick Access Toolbar and the Formula Bar. The 
Quick Access Toolbar allows you to quickly access workbook options. To add or remove 
features on the Quick Access Toolbar, click the Customize Quick Access Toolbar button iz 
at the end of the Quick Access Toolbar. 

The Formula Bar (see Figure E.3) contains a Name box, the Insert Function button 
fe, and a Formula box. In Figure E.3, “A1” appears in the name box because cell Al 
is selected. You can select any other cell in the worksheet by using the mouse to move 
the cursor to another cell and clicking or by typing the new cell location in the Name 
box. The Formula box is used to display the formula in the currently selected cell. For 
instance, if you enter =A1+A2 into cell A3, whenever you select cell A3 the formula 
=A1+A2 will be shown in the Formula box. This feature makes it very easy to see and 
edit a formula in a particular cell. The Insert Function button allows you to quickly ac- 
cess all the functions available in Excel. Later we show how to find and use a particular 
function. 
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FIGURE E.1 Blank Workbook Created When Excel is Opened 
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Basic Workbook Operations 


Figure E.4 illustrates the worksheet options that can be performed after right-clicking on a 
worksheet tab. For instance, to change the name of the current worksheet from “Sheet1” to 
“Data,” right-click the worksheet tab named “Sheet!” and select the Rename option. The 
current worksheet name (Sheet1) will be highlighted. Then, simply type the new name (Data) 
and press the Enter key to rename the worksheet. 


FIGURE E.2 Portion of the Home Tab 
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FIGURE E.3 Excel Quick Access Toolbar and Formula Bar 
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Suppose that you wanted to create a copy of “Sheet1.” After right-clicking the tab named 
“Sheet1,” select the Move or Copy option. When the Move or Copy dialog box appears, 
select Create a Copy and click OK. The name of the copied worksheet will appear as “Sheet1 
(2).” You can then rename it, if desired. 

To add a new worksheet to the workbook, right-click any worksheet tab and select the 
Insert option; when the Insert dialog box appears, select Worksheet and click OK. An addi- 
tional blank worksheet will appear in the workbook. You can also insert a new worksheet by 


FIGURE E.4 Worksheet Options Obtained after Right-Clicking 


on a Worksheet Tab 


Insert... 
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clicking the New sheet button + that appears to the right of the last worksheet tab displayed. 
Worksheets can be deleted by right-clicking the worksheet tab and choosing Delete. Work- 
sheets can also be moved to other workbooks or a different position in the current workbook 
by using the Move or Copy option. 


Creating, Saving, and Opening Files 


Data can be entered into an Excel worksheet by manually entering the data into the work- 
sheet or by opening another workbook that already contains the data. As an illustration 
of manually entering, saving, and opening a file we will use the example from Chapter 2 
involving data for a sample of 50 soft drink purchases. The original data are shown in 
Table E.1. 

Suppose we want to enter the data for the sample of 50 soft drink purchases into Sheet! 
of the new workbook. First we enter the label “Brand Purchased” into cell A1; then we 
enter the data for the 50 soft drink purchases into cells A2:A51. As a reminder that this 
worksheet contains the data, we will change the name of the worksheet from “Sheet!” to 
“Data” using the procedure described previously. Figure E.5 shows the data worksheet 
that we just developed. 

Before doing any analysis with these data, we recommend that you first save the file; this 
will prevent you from having to reenter the data in case something happens that causes Excel 
to close. To save the file as an Excel workbook using the filename SoftDrink we perform 
the following steps: 


Step 1: Click the File tab 
Step 2: Click Save in the list of options 
Step 3: When the Save As window appears: 
Select This PC 
Select Browse 
Select the location where you want to save the file 
Type the filename SoftDrink in the File name box 


Click Save 
= DATA file Coca-Cola Sprite Pepsi 
SoftDrink Diet Coke Coca-Cola Coca-Cola 

Pepsi Diet Coke Coca-Cola 
Diet Coke Coca-Cola Coca-Cola 
Coca-Cola Diet Coke Pepsi 
Coca-Cola Coca-Cola Dr. Pepper 
Dr. Pepper Sprite Coca-Cola 
Diet Coke Pepsi Diet Coke 
Pepsi Coca-Cola Pepsi 
Pepsi Coca-Cola Pepsi 
Coca-Cola Coca-Cola Pepsi 
Dr. Pepper Pepsi Pepsi 
Sprite Coca-Cola Coca-Cola 
Coca-Cola Sprite Dr. Pepper 
Diet Coke Dr. Pepper Pepsi 
Coca-Cola Pepsi Sprite 
Coca-Cola Diet Coke 
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Keyboard shortcut: To save 
the file, press CTRL+S. 
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Al ~ |G fe Brand v 
A | BONSE AMSER 

1 [Brand Purchased | 

2 Coca-Cola 

3 Diet Coke 

4 Pepsi 

5 Diet Coke 

6 Coca-Cola 

7 (Coca-Cola 

8 Dr. Pepper 

9 Diet Coke 

10 (Pepsi 

50 Pepsi 

51 Sprite 

52 X 

ao | Data ©) 4 > 


Note: Rows 11-49 are hidden. 


Excel’s Save command is designed to save the file as an Excel workbook. As you work with 
the file to do statistical analysis you should follow the practice of periodically saving the file 
so you will not lose any statistical analysis you may have performed. Simply click the File 
tab and select Save in the list of options. 

Sometimes you may want to create a copy of an existing file. For instance, suppose 
you would like to save the soft drink data and any resulting statistical analysis in a new file 
named “SoftDrink Analysis.” The following steps show how to create a copy of the SoftDrink 
workbook and analysis with the new filename, “SoftDrink Analysis.” 


Step 1: Click the File tab 
Step 2: Click Save As 
Step 3: When the Save As window appears: 
Select This PC 
Select Browse 
Select the location where you want to save the file 
Type the filename SoftDrink Analysis in the File name box 
Click Save 


Once the workbook has been saved, you can continue to work with the data to perform what- 
ever type of statistical analysis is appropriate. When you are finished working with the file 
simply click the File tab and then click close in the list of options. To access the SoftDrink 
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Analysis file at another point in time you can open the file by performing the following steps 
after launching Excel: 


Step 1: Click the File tab 
Step 2: Click Open 
Step 3: When the Open window appears: 
Select This PC 
Select Browse 
Select the location where you previously saved the file 
Enter the filename SoftDrink Analysis in the File name box 
Click Open 


The procedures we showed for saving or opening a workbook begin by clicking the File 
tab to access the Save and Open commands. Once you have used Excel for a while you will 
probably find it more convenient to add these commands to the Quick Access Toolbar. 


Using Excel Functions 


Excel provides a wealth of functions for data management and statistical analysis. If we 
know which function is needed, and how to use it, we can simply enter the function into 
the appropriate worksheet cell. However, if we are not sure which functions are available 
to accomplish a task, or are not sure how to use a particular function, Excel can provide 
assistance. Many new functions for statistical analysis have been added with Excel. To 
illustrate we will use the SoftDrink Analysis workbook created in the previous subsection. 


Finding the Right Excel Function 


To identify the functions available in Excel, select the cell where you want to insert 
the function; we have selected cell D2. Click the Formulas tab on the Ribbon and then 
click the Insert Function button in the Function Library group. Alternatively, click the fe 
button on the formula bar. Either approach provides the Insert Function dialog box shown 
in Figure E.6. 

The Search for a function box at the top of the Insert Function dialog box enables 
us to type a brief description of what we want to do. After doing so and clicking Go, 
Excel will search for and display, in the Select a function box, the functions that may 
accomplish our task. In many situations, however, we may want to browse through 
an entire category of functions to see what is available. For this task, the Or select a 
category box is helpful. It contains a drop-down list of several categories of functions 
provided by Excel. Figure E.6 shows that we selected the Statistical category. As a 
result, Excel’s statistical functions appear in alphabetic order in the Select a function 
box. We see the AVEDEV function listed first, followed by the AVERAGE function, 
and so on. 

The AVEDEV function is highlighted in Figure E.6, indicating it is the function cur- 
rently selected. The proper syntax for the function and a brief description of the function 
appear below the Select a function box. We can scroll through the list in the Select a func- 
tion box to display the syntax and a brief description for each of the statistical functions 
that are available. For instance, scrolling down farther, we select the COUNTIF function 
as shown in Figure E.7. Note that COUNTIF is now highlighted, and that immediately 
below the Select a function box we see COUNTIF(range,criteria), which indicates that 
the COUNTIF function contains two inputs, range and criteria. In addition, we see that the 
description of the COUNTIF function is “Counts the number of cells within a range that 
meet the given condition.” 

If the function selected (highlighted) is the one we want to use, we click OK; the 
Function Arguments dialog box then appears. The Function Arguments dialog box for 
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FIGURE E.6 | Insert Function Dialog Box 
Insert Function _ 


Search for a function: 


Type a brief description of what you want to do and then 
click Go 


Or select a category: | Statistical 


Select a function: 


AVEDEV 
AVERAGE 
AVERAGEA 
AVERAGEIF 
AVERAGEIFS 
BETA.DIST 
BETA.INV 


AVEDEV(number1,number2....) 


Returns the average of the absolute deviations of data points from their 
mean. Arguments can be numbers or names, arrays, or references that 
contain numbers. 


Help on this function 


FIGURE E.7 Description of the Countif Function in the Insert Function 
Dialog Box 


r 


Insert Function 


Search for a function: 


Type a brief description of what you want to do and then 
click Go 


Or select a category: | Statistical 


Select a function: 


COUNT 
COUNTA 
COUNTBLANK 
COUNTIF 
COUNTIFS 
COVARIANCE.P 
COVARIANCE.S 


COUNTIF (range, criteria) 
Counts the number of cells within a range that meet the given condition. 


Help on this function 
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FIGURE E.8 Function Arguments Dialog Box for the Countif Function 


Function Arguments Lo | Sama) 


COUNTIF 


Counts the number of cells within a range that meet the given condition. 


Range is the range of cells from which you want to count nonblank cells. 


Formula result = 


the COUNTIF function is shown in Figure E.8. This dialog box assists in creating the 
appropriate arguments (inputs) for the function selected. When finished entering the argu- 
ments, we click OK; Excel then inserts the function into a worksheet cell. 


Using Excel Add-Ins 
Excel’s Data Analysis Add-In 


Excel’s Data Analysis add-in, included with the basic Excel package, is a valuable tool 
for conducting statistical analysis. Before you can use the Data Analysis add-in it must be 
installed. To see if the Data Analysis add-in has already been installed, click the Data tab on 
the Ribbon. In the Analyze group you should see the Data Analysis command. If you do not 
have an Analyze group and/or the Data Analysis command does not appear in the Analysis 
group, you will need to install the Data Analysis add-in. The steps needed to install the Data 
Analysis add-in are as follows: 


Step 1. Click the File tab 
Step 2. Click Options 
Step 3. When the Excel Options dialog box appears: 
Select Add-Ins from the list of options (on the pane on the left) 
In the Manage box, select Excel Add-Ins 
Click Go 
Step 4. When the Add-Ins dialog box appears: 
Select Analysis ToolPak 
Click OK 
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A workbook is a file 
containing one or more 
worksheets. 


The ribbon in desktop 
Excel also includes the 
Page Layout and Formulas 
tabs. 


The Tables group in Excel 
Online is similar in many 
ways to the Styles group in 
desktop Excel. 


Excel Online's Clipboard 
tab contains several of the 
features in desktop Excel's 
Quick Access Toolbar. 


Microsoft Excel Online is a spreadsheet program that can be used to organize and analyze 
data, perform complex calculations, and create a wide variety of graphical displays. With 
Microsoft Excel Online, you can use your web browser to create, edit, view, and share 
workbooks stored on OneDrive or Dropbox. We assume that readers are familiar with 
basic Excel operations such as selecting cells, entering formulas, and copying. But we 
do not assume readers are familiar with Excel Online or the use of Excel for statistical 
analysis. 

The purpose of this appendix is to give a brief of overview of Excel Online and contrast 
it with desktop versions of Excel. Excel Online is very similar in many ways to desktop 
Excel, but there are also some important differences for those that are familiar with using 
desktop Excel. For instance, Excel’s Data Analysis add-in, a valuable tool for conducting 
statistical analysis that is included with desktop versions of Excel is not available with 
Excel Online. Some add-ins that are useful for statistical analysis are available through 
the Office Store. To find a list of add-ins that are available for Excel Online, click on the 
Insert tab and select Office Add-ins in the Add-ins Group, then click on Office Store in 
the Office Add-ins box. 


When using Excel Online for statistical analysis, data is displayed in workbooks, each 
of which contains a series of worksheets that typically include the original data as well 
as any resulting analysis, including charts. Figure F.1 shows the layout of a blank work- 
book that can be opened in Excel Online. The workbook is named Book 1, and contains 
one worksheet named Sheet1. Excel Online highlights the worksheet currently displayed 
(Sheet1) by setting the name on the worksheet tab in bold. Note that cell A1 is initially 
selected. 

The wide bar located across the top of the workbook is referred to as the Ribbon. Tabs, 
located at the top of the Ribbon, provide quick access to groups of related commands. There 
are six tabs shown on the workbook in Figure F.1: File; Home; Insert; Data; Review; and 
View. Each tab contains a series of groups of related commands. Note that the Home tab is 
selected when Excel Online is opened. Figure F.2 displays the groups available when the 
Home tab is selected. Under the Home tab there are eight groups: Undo; Clipboard; Font; 
Alignment; Number; Tables; Cells; and Editing. Commands are arranged within each group. 
For example, to change selected text to boldface, click the Home tab and click the Bold |B 
button in the Font group. 

Figure F.3 illustrates the location of the Formula Bar. 

The Formula Bar (see Figure F.3) contains a Name box, the Insert Function button Sx, and 
a Formula box. In Figure F.3, “A1” appears in the name box because cell A1 is selected. 
You can select any other cell in the worksheet by using the mouse to move the cursor to 
another cell and clicking or by typing the new cell location in the Name box. The For- 
mula box is used to display the formula in the currently selected cell. For instance, if you 
enter =A/+A2 into cell A3, whenever you select cell A3 the formula =A1+A2 will be 
shown in the Formula box. This feature makes it very easy to see and edit a formula in 
a particular cell. The Insert Function button Jx allows you to quickly access all the functions 
available in Excel Online. Later we show how to find and use a particular function. 
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FIGURE F.1_ Blank Workbook Created with Excel Online 


Ribbon 


Cell Al bs 

selected 
4 
1 
14 

Workbook 
1 Pras Tab 
Sheet! + 


Basic Workbook Operations 


Figure F.4 illustrates the worksheet options that can be performed after right-clicking on 
a worksheet tab. For instance, to change the name of the current worksheet from “Sheet1” 
to “Data,” right-click the worksheet tab named “Sheet!” and select the Rename option. 
The current worksheet name (Sheet1) will be highlighted. Then, simply type the new name 
(Data) and press the Enter key to rename the worksheet. 

Suppose that you wanted to create a copy of “Sheet!.” After right-clicking the tab named 
“Sheet,” select the Duplicate option. The name of the copied worksheet will appear as 
“Sheet! (2).” You can then rename it, if desired. 


FIGURE F.2 Portion of the Home Tab 


Sheet! + 
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FIGURE F.3 Excel Online Formula Bar 


Sheet! 


To add a new worksheet to the workbook, right-click any worksheet tab and select 
the Insert option; when the Insert dialog box appears, select Worksheet and click OK. 
An additional blank worksheet will appear in the workbook. You can also insert a new 
worksheet by clicking the New sheet button that appears to the right of the last work- 
sheet tab displayed. Worksheets can be deleted by right-clicking the worksheet tab and 
choosing Delete. Worksheets can also be copied in the current workbook by using the 
Duplicate option. 


FIGURE F.4 Worksheet Options Obtained After Right-Clicking 


ona Worksheet Tab 


A B G D E 
_ 
4 
Excel Online’s Worksheet 5 
Options do not include the 6 
Move or Copy, View Code, : 
Protect Sheet, or Select All 
Sheets options found in 10 EH Insert 
desktop Excel (sheets can 11 
still be moved by dragging 12 
Rename... 
them). 3 
14 
15 Duplicate 
16 Tab Color > 
17 
18 
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OneDrive is Microsoft's 
storage service for hosting 
files in the cloud for an 
owner of a Microsoft 
account. 


Make sure that your 
Internet connection is 
active when you close the 
Excel Online file so all of 
your changes are saved in 
the OneDrive file. 


(@ 


DATA file 


SoftDrink 
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Creating, Saving, Opening, and Uploading Files 


Data can be entered into an Excel Online worksheet by manually entering the data into the 
worksheet or by opening another workbook that already contains the data. As an illustration of 
manually entering, saving, and opening a file we will use the example from Chapter 2 involv- 
ing data for a sample of 50 soft drink purchases. The original data are shown in Table F.1. 
Suppose we want to enter the data for the sample of 50 soft drink purchases into Sheet! of 
the new workbook. First we enter the label “Brand Purchased” into cell A1; then we enter the 
data for the 50 soft drink purchases into cells A2:A51. As a reminder that this worksheet con- 
tains the data, we will change the name of the worksheet from “Sheet!” to “Data” using the 
procedure described previously. Figure F.5 shows the data worksheet that we just developed. 
Excel Online does not have a Save button. Any changes you make to the Excel Online 
spreadsheet will automatically be saved to the existing file in OneDrive. However, you may 
wish to give the file a more descriptive name than what Excel Online has assigned by default. 
To change the name of the Excel Online file on OneDrive, we perform the following steps: 


Step 1: Click the File tab 

Step 2: Click Save As in the list of options 

Step 3: When the Save As window appears: 
Select Rename 
Type the filename SoftDrink in the Rename box 
Click OK 


The current Excel Online file is now saved on OneDrive under the file name SoftDrink. 
We may also want to save a copy of an Excel file to our computer’s hard drive. Excel 
Online’s Download a Copy command is designed to save an Excel Online file to your compu- 
ter’s hard drive as an Excel 2016 workbook. To save the Excel Online file to your computer’ s 
hard drive as an Excel workbook using the filename SoftDrink, we perform the following steps: 


Step 1: Click the File tab 
Step 2: Click Save As in the list of options 
Step 3: When the Save As window appears, select Download a Copy 


You may want to create a copy of an existing file. For instance, suppose you would 
like to save the soft drink data and any resulting statistical analysis in a new file named 
SoftDrinkAnalysis. The following steps show how to create a copy of the SoftDrink Excel 


TABLE F.1 Data from a Sample of 50 Soft Drink Purchases 


Coca-Cola Sprite Pepsi 

Diet Coke Coca-Cola Coca-Cola 
Pepsi Diet Coke Coca-Cola 
Diet Coke Coca-Cola Coca-Cola 
Coca-Cola Diet Coke Pepsi 
Coca-Cola Coca-Cola Dr. Pepper 
Dr. Pepper Sprite Coca-Cola 
Diet Coke Pepsi Diet Coke 
Pepsi Coca-Cola Pepsi 
Pepsi Coca-Cola Pepsi 
Coca-Cola Coca-Cola Pepsi 

Dr. Pepper Pepsi Pepsi 
Sprite Coca-Cola Coca-Cola 
Coca-Cola Sprite Dr. Pepper 
Diet Coke Dr. Pepper Pepsi 
Coca-Cola Pepsi Sprite 
Coca-Cola Diet Coke 
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Al ~ fx Brand Purchased 
A B C 
1 [Brand Purchased | 
2 Coca-Cola 
3 Diet Coke 
4 Pepsi 
5 Diet Coke 
6 (Coca-Cola 
7 Coca-Cola 
8 Dr Pepper 
9 Dhiet Coke 
10 Pepsi 
50 Pepsi 
51 Sprite 
52 
4 
Data (+) 


Note: Rows 11-49 are hidden. 


Online workbook and analysis on OneDrive as a new Excel Online file with the new 
filename, SoftDrinkAnalysis. 


Step 1: Click the File tab 

Step 2: Click Save As 

Step 3: When the Save As window appears: 
Type the filename SoftDrinkAnalysis in the Name: box 
Click Save 


Once the workbook has been saved to OneDrive you can continue to work with the data to 
perform whatever type of statistical analysis is appropriate. 

This will open the file you selected in Excel Online in a new tab on your browser. 

One important feature of Excel Online is the ability for multiple users to open, view, 
and edit a worksheet simultaneously; this is called co-authoring. When you and other users 
co-author, you can see each other’s changes quickly, and this supports real-time collabor- 
ation. You may want to upload an Excel file from your computer to OneDrive in order to 
co-author with one or more other people on a worksheet. To upload an existing Excel file 
to OneDrive, sign into your OneDrive account, open the Excel file you want to upload to 
OneDrive, and perform the following steps: 


Step 1: Click the File tab 

Step 2: Click Save As in the list of options 

Step 3: When the Save As window appears: 
Select OneDrive - Personal 
Select the OneDrive location to which you want to upload the file 
Click Save 
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The current Excel file is now saved on OneDrive and can simultaneously be opened and 
edited by several users. 

Although Excel Online supports real-time collaboration, the manner in which worksheets 
can be edited changes when multiple users simultaneously have Excel Online file open. When 
a workbook is simultaneously open by multiple users, the Undo and Redo commands are not 
available in one user’s browser window as soon as another user makes a change. We suggest 
saving a backup copy of an Excel Online file before multiple users simultaneously open and 
edit the file to protect against inadvertently losing important features of the original file. 


Using Excel Functions 


Almost all of the functions Excel Online provides many, but not all of the functions for data management and statistical 
available through desktop analysis provided in desktop versions of Excel. If we know which Excel Office function is 


Excel's Insert Function needed, and how to use it, we can simply enter the function into the appropriate worksheet 
button are also available cell. However, if we are not sure which functions are available to accomplish a task, or are 
through Excel Online’s not sure how to use a particular function, Excel Online can provide assistance. To illustrate 


Insert Function button fẹ. we will use the SoftDrinkAnalysis workbook created in the previous subsection. 


Finding the Right Excel Online Function 


To identify the functions available in Excel Online, select the cell in which you want to 
insert the function; we have selected cell D2. Click the fx button on the formula bar to access 
the Insert Function dialog box shown in Figure F.6. 


FIGURE F.6 Insert Function Dialog Box 


Insert Function 


Pick a category: | Statistical 


Pick a function: 


AVEDEV 
AVERAGE 
AVERAGEA 
AVERAGEIF 
AVERAGEIFS 
BETA.DIST 
BETA.INV 


AVEDEV (numbert1, [number2], ...) 


Returns the average of the absolute deviations of data 
points from their mean. Arguments can be numbers or 
names, arrays, or references that contain numbers 
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Excel Online does not 
currently feature the 
Search for a function box 
in the Insert Dialog box. 
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We can scroll through the options in the Pick a function: box to browse for the func- 
tions to find an available function that may accomplish our task. In many situations, how- 
ever, we may want to browse through an entire category of functions to see what functions 
are available. For this task, the Pick a category: box is helpful. It contains a drop-down 
list of several categories of functions provided by Excel Online. Figure F.6 shows that we 
selected the Statistical category. As a result, Excel Online’s statistical functions in this 
category appear in alphabetic order in the Pick a function: box. We see the AVEDEV 
function listed first, followed by the AVERAGE function, and so on. 

The AVEDEV function is highlighted in Figure F.6, indicating it is the function cur- 
rently selected. The proper syntax for the function and a brief description of the function ap- 
pear below the Pick a function: box. We can scroll through the list in the Pick a function: 
box to display the syntax and a brief description for each of the statistical functions that are 
available. For instance, scrolling down farther, we select the COUNTIF function as shown 
in Figure F.7. Note that COUNTIF is now highlighted, and that immediately below the Pick 
a function: box we see COUNTIF(range,criteria), which indicates that the COUNTIF 
function contains two inputs, range and criteria. In addition, we see that the description of 
the COUNTIF function is “Counts the number of cells within a range that meet the given 
condition.” 


FIGURE F.7 Description of the COUNTIF Function in the Insert Function 
Dialog Box 


Insert Function 


Pick a category: | Statistical 


Pick a function: 


COUNT 
COUNTA 
COUNTBLANK 
COUNTIF 
COUNTIFS 
COVARIANCE.P 
COVARIANCE.S 


COUNTIF (range, criteria) 


Counts the number of cells within a range that meet the 
given condition 


Cancel 
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FIGURE F.8 COUNTIF Function in Cell D2 of the Softdrinkanalysis Excel 
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19 Sprite 


29 Coca-Cola 
4 


Data + 


If the function selected (highlighted) is the one we want to use, we click OK; the 
function then appears in the cell we originally selected (D2); the function is open and 
ready for us to enter the inputs of the function. Underneath the selected cell, Excel Online 
provides a box that shows the names of the inputs for the functions and a hyperlink to a 
brief explanation of the function. This is shown in Figure F.8. 
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A 


Accounting applications, 3 
Addition law, 188-190, 208 
Adjusted multiple coefficient of 
determination, 699, 724 
Air traffic controller stress test, 
576-577 
Alternative hypotheses, 398—401, 406 
American Statistical Association, 
22-23 
Analysis of variance (ANOVA). See 
also ANOVA procedure; ANOVA 
tables 
assumptions for, 556 
between-treatments estimate of 
population variance, 559-560 
comparing variance estimates, F 
test, 561-562 
completely randomized design, 
558-569 
conceptual overview, 556-558 
example of, 552 
introduction to, 553-558 
output, interpretation of, 648—649 
summary, 593-594 
testing for the equality of k 
population means, 564-566 
Analytics, 20 
ANOVA procedure 
in air traffic controller stress 
test, 576 
ANOVA table and, 577-578 
Excel used for, 579-581, 
589-591 
overview of, 585-586 
ANOVA tables 
ANOVA procedure and, 577-579 
example of, 589, 635-636, 704 
overview of, 562-566 
for two-factor factorial experiment, 
586 
Approximate class width, 92 
Area as a measure of probability, 
276-277 
normal distribution, 279-293 
standard normal distribution, 
281-285 
Association between two variables 
correlation coefficient, 148—149 
covariance, 144-146 
covariance, interpretation of, 
146-148 
measures of, 144-153 
AT&T, 21 
AVERAGE function, 19 


B 


Bar charts 
example of, 87-88 
Excel used to create, 42-43 
histograms and, 59 
introduction to, 40-41 
side-by-side bar chart, 85-86 
Basic requirements for assigning 
probabilities, 178 
Bayes, Reverend Thomas, 203 


Bayes’ theorem 
definition of, 179 
formula, 208 
overview of, 201-204 
tabular approach, 204—205 
Bell-shaped curve, 279 
Bell-shaped distribution, 133—134 
Bernoulli, Jakob, 242 
Bernoulli process, 242 
B (Parameters of regression model), 
608-609, 630, 632 
Between-treatments estimate, 557 
Bias in selection, 313-314 
Big data, 21-22, 384-386 
confidence intervals, precision of, 
384-385 
errors in sampling, 339-345 
hypothesis tests, 434—436 
implications of, 343—345, 
385-386, 436 
p-values and, 434-436 
simple linear regression, 670-671 
Bimodal data, 108 
Binomial experiment 
example of, 244-249 
Excel used to compute, 248-249 
expected value and variance, 249-250 
overview of, 242-243 
Binomial probability distribution 
binomial experiment, 242-243 
example of, 244-249 
Excel used to compute, 248-249 
expected value and variance, 
249-250 
multinomial probability 
distribution, 519, 540 
normal distribution and, 333 
Binomial probability function, 243, 
246-247, 263 
Bivariate probability distribution, 
233-236 
Blocking, 575-576, 579 
Blocks, 576-579 
Bloomberg, 10 
Bloomberg Businessweek, 2 
Bonferroni adjustment, 573 
Boxplot, 142 
comparative analysis, using Excel 
to construct, 140-141 
Excel used to create, 139-140 
overview of, 138 
Burke Inc., 552 
Business applications, 3—4 
BusinessWeek, 2 


C 


Cash flow management, 104 
Categorical data, 7, 37 
Categorical independent variables 
example of, 710-713 
interpreting the parameters, 
712-713 
more complex categorical variables, 
713-715 
Categorical variable, 7 
applications, 44—47 
bar charts, 40—47 


frequency distribution, 37-38 
methods, 44 
percent frequency distribution, 
38-40 
pie charts, 40-47 
relative frequency distribution, 
38—40 
summarizing data for, 37—47 
Cause-and-effect relationships, 610, 636 
Census, 15 
Center for Drug Evaluation and 
Research (CDER), 446 
Central limit theorem, 324-325, 411 
Chebyshev’s theorem, 132-133, 135 
Chi-square (x7) distribution 
hypothesis testing and, 497-498 
interval estimation and, 491—495 
testing for equality of population 
proportions, 537 
test of independence and, 527-529 
test statistics, 521, 530 
Chi-square test statistic 
equality of population proportions, 
testing for, 534, 536-537 
hypothesis tests, 540 
test of independence and test for 
equality, 544 
Cincinnati Zoo and Botanical Garden, 
88-90 
Classical method for assigning 
probabilities, 178, 185, 221 
Class limits, 48-49, 59 
Class midpoint, 49 
Class width, 48 
Cluster sampling, 306, 314, 337-339 
Coefficient of determination, 629, 673 
correlation coefficient, 626 
Excel used to compute, 625 
multiple regression, 698-700 
relationship among SST, SSR, and 
SSE, 624 
sum of squares due to error (SSE), 
621-623 
sum of squares due to regression 
(SSR), 624 
total sum of squares (SST), 622 
Coefficient of variation, 126-127, 159 
Coefficients, interpretation of, 693 
Colgate-Palmolive Company, 36 
Column chart, 41 
Combinations, 176-177, 181 
Comparisons, displays for, 86 
Comparisonwise Type I error rate, 
572-573 
Complements, 187-188 
Complete block design, 579 
Completely randomized design 
ANOVA table for, 562-566 
between-treatments estimate of 
population variance, 559-560 
definition of, 554 
equality of population means, 
testing for, 564-566 
Excel for, 563-564 
F test, 561-562 
introduction to, 558-559 
overview of, 554 
within-treatments estimate of 
population variance, 560 


Index 


Conditional probability, 
193-201 
formula, 208 
independent events, 196 
multiplication law, 196-197 
Confidence 
hypothesis testing and, 415-417 
precision of, 384-385 
usefulness of, 385-386 
Confidence coefficient, 359 
Confidence intervals, 359, 362, 369, 
372, 374, 381. See also Interval 
estimation 
estimated regression equation, 
640-641, 643, 708 
Excel used to construct, 449-451, 
458-460, 476-477, 495-496 
Fisher’s LSD procedure, 571-572 
formula, 673 
significance tests, 633-634 
Confidence levels, 359-360 
Continuous probability distributions, 
273-303 
exponential probability distribution, 
293-297 
normal probability distribution, 
279-293 
summary, 298 
uniform probability distribution, 
275-279 
Continuous random variables, 220 
Convenience sampling, 338-339, 345 
Correlation between random 
variables, 263 
Correlation coefficient, 148-149. 
See also Sample correlation 
coefficient 
coefficient of determination 
and, 626 
Excel used to compute, 151 
interpretation of, 149-151 
for two random variables, 236 
Counting rules, 174-178 
for combinations, 177, 181, 208 
for multiple-step experiments, 
174-175 
for permutations, 177—178, 208 
Covariance, 144-146 
discrete probability distributions, 
233-239 
Excel used to compute, 151 
interpretation of, 146-148 
of random variables, 235, 263 
Coverage error, 340 
Critical value approach 
chi-square test and, 537 
for Marascuilo pairwise comparison 
procedure, 544 
one-tailed tests, 410 
overview of, 415 
p-value approach versus, 430 
rejection rule, 521, 529, 562, 570, 
633, 635, 684, 703-704 
t test for significance in simple 
linear regression, 633 
two-tailed tests, 412-413 
Cross-sectional data, 8-10 
Crosstabulation, 65-69 
Excel used to create, 68-69 
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Cumulative frequency distribution, 
55-56, 59 

Cumulative percent frequency 
distribution, 56 

Cumulative probabilities, 298 

Cumulative relative frequency 
distribution, 56 


D 


Dashboards. See Data dashboards 
Data, 5-10 
acquisition errors, 13 
big data and data mining, 21—22 
categorical and quantitative, 7 
collection of, 554—555 
cross-sectional, 8—10 
definition of, 5 
elements, variables, and 
observations, 5-7 
existing sources, 10-11 
experiments, 11 
observational study, 11-12 
scales of measurement, 5, 7 
sources of, 10-13 
summary, 24 
time and cost issues, 13 
time series, 8-10 
Data analytics, 20 
Data dashboards, 86-90, 153-156 
DATA.GOYV, 11 
Data mining, 21-22 
Data set, 5—7 
Data visualization, 37, 85-90 
Data warehousing, 21 
Deciles, 116 
Decision analysis 
description of, 205 
Degrees of freedom 
of chi-square distribution, 492-493, 
497, 527 
definition of, 364 
error, 581 
of F distribution, 503-504 
of t distribution, 457-458, 462, 467 
of t distribution with two 
independent random 
samples, 484 
de Moivre, Abraham, 279 
Dependent events, 196 
Dependent variables, 607, 702 
Descriptive analytics, 20 
Descriptive statistics 
association between two variables, 
measures of, 144-153 
boxplot, 138-141 
data dashboards, 153-156 
distribution shape and relative 
location, measures of, 130-135 
Excel for, 368-369 
five-number summary, 138 
location, measures of, 105—121 
numerical measures, 103—170 
outliers, detecting, 134-135 
overview of, 13-15 
variability, measures of, 121—130 
Deviation about the mean, 122 
Difference between two population 
proportions, 474-476 
Difference of population means 
hypothesis tests, 451-454, 460-463 
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interval estimate, 484 
interval estimation, 447-451, 
457-458 
Discrete probability distributions, 
217-271 
binomial experiment, 242-252 
bivariate probability distribution, 
233-236 
covariance, 233-239 
developing, 221-226 
expected value, 226-227 
financial applications, 236-239 
hypergeometric probability 
distribution, 257-261 
Poisson probability distribution, 
252-257 
random variables, 218-221 
required conditions for, 223 
summary, 261-262 
variance, 227—229 
Discrete random variable, 219, 263 
Discrete uniform probability 
distribution, 224, 263 
Dispersion, measures of, 105, 121, 
157, 227 
Distance intervals, 254 
Distribution of data, displays for, 86 
Distribution shape, measures of, 
130-131 
Dot plots, 51-52, 86, 105 
Double-blind experimental 
design, 558 
Dow Jones & Company, 10 
Dow Jones Industrial Average 
Index, 8 
Drug efficacy testing, 446 
Dummy variables 
complex categorical variables, 
713-715 
in regression analysis, 711 
Dun & Bradstreet, 10 


E 


Economic applications, 4 
Elements of data, 5-7, 306 
Empirical discrete distribution, 222 
Empirical rule, 133-134 
Equality 
testing for the equality of k 
population means, 564-566 


testing for three or more population 


proportions, 534-543 
Errors in sampling, 339-345 
Error term €, 629-630, 701 
Estimated multiple regression 
equation, 687—688, 724 
Excel used to create, 691-693 
Estimated regression equation 
ANOVA output, interpretation of, 
648-649 
confidence intervals, 640-641 
for estimation and prediction, 
639-645 
Excel used to construct, 614-615 
formula, 672 
interpretation of output, 647—648 
interval estimation, 640 
prediction interval, 641—643 
regression statistics output, 
interpretation of, 649 


simple linear regression model, 
609-610 


using for estimation and prediction, 


708-710 
Estimated regression line, 609, 
614-615 
Estimated simple linear regression 
equation, 672 
Estimated standard deviation, 
632-633, 673 
“Ethical Guidelines for Statistical 
Practice,” 22 
Ethics for statistical practice, 22-24 
Events 
complement of, 187-188 
definition of, 183 
probabilities, 183—187 
Expected frequencies, 526-527, 
529-530, 535-537, 540, 544 
Expected value (EVs), 226-227 
of p, 350 
of x, 349 
for binomial distribution, 
249-250, 263 
of discrete random variable, 263 
Excel used to compute, 228-229 
in financial applications, 237—238 
for hypergeometric distribution, 
264 
of linear combination of random 
variables, 263 
Experiment(s), 11 
multiple, 174-177 
probability and, 173-174 
in statistics vs. physical 
sciences, 181 
Experimental design 
data collection, 554-555 
example of, 552 
factorial experiment, 584-593 
introduction to, 553-558 
randomization, 558 
randomized block design, 575-583 
summary, 593-594 
Experimental outcomes, number 
of, 263 
Experimental study, 553 
Experimental units, 554 


Experimentwise Type I error rate, 572 


Exponential distribution, 298 
Exponential probability density 
function, 293, 298 
Exponential probability 
distribution, 293 
computing, 294 
Excel used to compute, 295-296 
Poisson distribution, relationship 
between, 295 
Extreme outliers, 422 


F 


Factorial experiment, 584-585 
ANOVA procedure, 585-586 
computations and conclusions, 

586-593 
introduction to, 584-585 

Factorial notation (!), 177 

Factor of interest, 576 

Factors, 554. See also Variables 

F distribution, 503-509 


Federal Trade Commission 
(FTC), 406 
Financial applications, 3-4, 236-239 
Finite population, sampling from, 
308-312, 332, 375, 381 
Finite population correction factor, 
323, 381 
Fisher, Ronald Aylmer, 553 
Fisher’s least significant difference 
(LSD), 570-572, 596 
Fitch Outlook ratings, 5-7, 13-14 
Fitch Rating, 5-7 
Five-number summary, 138, 142 
Food and Agriculture Organization 
(FAO), 306 
Food Lion, 356 
Formula worksheet, 20 
Four Vs, 342 
F probability distribution, 634 
Frame, 307 
France Telecom, 21 
Frequency distribution 
categorical variable, 37-38 
crosstabulation and, 67 
example of, 36, 520 
Excel used to create, 39-40, 50-51 
observed, 526-527, 535 
quantitative variable, 47-48 
F statistic, 578 
F test 
ANOVA output, interpretation 
of, 649 
introduction to, 561-562 
in multiple regression, 722 
residual analysis, 652 
testing for significance, 634-636, 
702-704 
F test statistic, 673, 725 
F value, 563 


G 


Galton, Francis, 607 
Gauss, Carl Friedrich, 612 
General Electric (GE), 21 
Geographic Information System 
(GIS), 90 
Geometric mean, 111-113, 158 
Goodness of fit test, 519-525 
Excel used to conduct, 523 
formula, 544 
multinomial probability 
distribution, 519-525 
summary, 543 
Gosset, William Sealy, 364 
Graduate Management Admission 
Council, 11 
Graduate Management Admissions 
Test (GMAT), 584-585, 
587-590 
Graphical displays for summarizing 
data 
best practices for, 85-90 
data dashboards, 86-88 
examples of, 88-90 
scatter diagram, 76-79 
side-by-side bar chart, 79-82 
stacked bar chart, 79-82 
trendline, 76-79 
two variables, 75-84 
type of, selecting, 86 
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H 


High leverage points, 667 
Histogram, 52-55, 86 
bar charts and, 59 
examples of, 36, 130-131 
Excel used to create, 54-55 
Hypergeometric probability 
distribution, 257—261 
Excel used to compute, 259 
Hypergeometric probability function, 
258, 264 
Hypothesis tests 
big data, 434—436 
of difference of population means, 
451—454, 460—463 
Excel for, 413—414, 423—425, 
452—454, 462—463, 469—470, 
479—480, 498-500, 507-509 
interval estimation and, 
415-417 
matched samples, 477—480 
multiple regression, 722-723 
null and alternative hypotheses, 
developing, 399-402 
one-tailed tests, 405—410 
population mean known, 
405-420 
population mean unknown, 
420—428 
population proportion, 428—434 
population variance, 496-500 
in simple linear regression, 
670-671 
steps of, 415 
summary, 437-438 
summary and practical advice, 
414-415, 425 
two-tailed tests, 410-415 
Type I and Type II errors, 
402-405 


Incomplete block design, 579 


Independence, tests of, 518, 525-530, 


540 
Excel used to conduct, 529-530 
formula, 544 
summary, 543 
Independence of trials, 243 
Independent events, 196-197, 208 
Independent random samples, 447, 
452, 475 
Independent sample design, 467, 471 
Independent variables, 705. See 
also Categorical independent 
variables 
definition of, 607 
F test, 634 
high leverage points, 667 
least squares method, 688 
multiple regression analysis, 687 
Indicator variables, 711, 713-715 
Individual significance, 702 
Inference about two populations 


population means matched samples, 


difference between, 467-474 

population means: o, and ©, 
known, difference between, 
447-456 


population means: o, and o,, 
unknown, difference between, 
456-467 

population proportions, difference 
between, 474—482 

summary, 483 

Infinite population, sampling from, 
312-314, 332, 381 

Influential observation, 665-667 

Information systems applications, 4 

Interactions, 585 

International Paper, 686 

Interquartile range (IQR), 122, 158 

Intersection of A and B, 189 

Intersection of two events, 189 

Interval estimate 

definition of, 356 

difference between two population 
proportions, 484 

margin of error, 365-369 

of population mean known, 388 

of population mean unknown, 388 

of population proportion, 388 

summary, 387 

Interval estimation, 640 
big data and, 384-386 
difference between population 
means known, 484 
erence between two population 
means, 469 
of difference of population means, 
447-451, 457-458 

of difference of population means 
unknown, 484 

of difference of population 
proportions, 474-477 

hypothesis tests and, 415-417 

margin of error, 356-362 

population mean known, 356-364 

population mean unknown, 
364-374 

population proportion, 377-384 

population variances, 491—495, 511 

sample size, determining, 374-377, 
380-381 

summary, 371 

Interval scale, 7 

Intranets, 4 

IQR (interquartile range), 122, 158 
IRI, 4, 10 

ith observation, 105, 610, 612, 621, 688 
ith residual, 621, 651, 655, 658, 718 


J 


John Morrell & Company, 398 
Joint probabilities, 194-195 
Judgment sampling, 339, 345 


K 


Key performance indicators (KPIs), 
87, 153 

Key performance metrics (KPMs), 87 

k population, 564-566 


L 


Leaf unit, 59 
Least squares criterion, 612, 615, 672, 
688-689, 724 


di 


Least squares method 
coefficients, interpretation of, 693 
example of, 689-693 
multiple regression, 688—697 
simple linear regression, 610—621 
Length intervals, 254 
Levels of significance (alpha), 359, 
403-404 
Leverage of observation i, 674 
Leverage points, 667 
LIFO (last-in, first-out) method of 
inventory valuation, 356 
Local area networks (LANs), 4 
Location, measures of 
geometric mean, 111-113 
mean, 105-107, 116 
median, 107—108, 116 
mode, 108-110 
percentiles, 113-114 
quartiles, 114-116 
weighted mean, 110 
Location of the pth percentile, 158 
Lower class limits, 48 
Lower tail test 
population mean, 405, 
407-410, 452 
population proportion, 428 


M 


MAE (mean absolute error), 127 
Marascuilo procedure, 538-539, 544 
Marginal probabilities, 194-195 
Margin of error 
interval estimate and, 365-369 
interval estimation and, 448 
introduction to, 356-357 
population mean, 356-362, 372 
population proportion, 377-378 
sample size, 375, 380-381 
Marketing applications, 4 
Market research, 398 
Matched sample(s) 
definition of, 467 
hypothesis tests, 477-480, 484 
population means, inferences about 
difference between, 467-474 
Matched sample design, 467, 471 
McDonald’s Inc., 8 
Mean, 105-107, 116-117, 131, 
280, 286 
Mean absolute error (MAE), 127 
Mean square due to error (MSE) 
ANOVA procedure, 587 
formula, 595, 673, 724 
F test, 703-704 
population variance, 560 
randomized block design, 575-576, 
578-579 
testing for significance, 631—632 
Mean square due to treatments 
(MSTR), 559-560, 575, 
578-579, 595 
Mean square regression (MSR), 634, 
673, 703-704, 724 
Measurement, scales of, 5, 7 
Measurement error, 341 
Median, 107-108, 116-117, 
131, 280 
Microsoft, 3-4 
Microsoft Excel 
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ANOVA procedure, 579-581, 
589-591 

ANOVA table, 563-564 

bar and pie charts, 42-43 

binomial probabilities, 248—249 

boxplot, 139-140 

coefficient of determination, 625 

comparative analysis using 
boxplots constructed using, 
140-141 

confidence intervals, 449-451, 
458-460, 476-477, 495-496 

covariance and correlation 
coefficient, 151 

crosstabulation, 68—69 

data sets and Excel worksheets, 
17-18 

Descriptive Statistics tool, 126-127 

estimated multiple regression 
equation, 691-693 

estimated regression equation, 
displaying, 614-615 

estimated regression line, 
displaying, 614-615 

expected value, 228-229 

exponential probability distribution, 
295-296 

frequency distribution, 39-40, 
50-51 

geometric mean, 112-113 

goodness of fit test, 523 

histogram, 54-55 

hypergeometric probability 
distribution, 259 

hypothesis tests, 413-414, 
423-425, 452-454, 462-463, 
469-470, 479-480, 498-500, 
507-509 

interval estimate, 361 

interval estimation, 368-369 

interval estimation of population 
proportion, 378-380 

mean, median, and mode, 109-110 

normal probabilities, 288-290 

NORM.DIST function, 327, 
334-335, 344 

percent frequency distribution, 
39-40 

percentiles and quartiles, 115-116 

Poisson probability distribution, 
254-256 

population proportion, 430-431 

regression tool, 646-651 

relative frequency distribution, 
39-40 

residual plot, 657—660 

scatter diagram, 77-79, 614-615 

side-by-side bar chart and stacked 
bar chart, 81-82 

simple random sample, 308-312 

standard deviation, 125, 228-229 

statistical analysis using, 16-19 

test of independence, 529-530 

test of multiple proportions, 
539-540 

trendline, 77-79 

variance, 125, 228-229 


Mode, 108-110, 116-117, 280 
MSE. See Mean square due to error 


(MSE) 


MSR (Mean square regression), 634, 


673, 703-704, 724 
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MSTR (mean square due to 
treatments), 559-560, 575, 
578-579, 595 

Multicollinearity, 705-706, 722-723 

Multimodal data, 108 

Multinomial probability distribution, 
519-525 

Multiple coefficient of determination, 
698-700, 724 

Multiple comparison procedures 

Fisher’s LSD, 570-572 
Type I error rates, 572-573 
Multiple experiments, 174-177 
Multiple proportions 
equality, testing for, 534-543 
Excel used to conduct, 539-540 
Multiple regression 
big data and hypothesis testing in, 
722-723 
categorical independent variables, 
710-718 
estimated regression equation, 
using for estimation and 
prediction, 708-710 
least squares method, 688-697 
model assumptions, 700-702 
model of, 687-688, 700-702, 724 
multiple coefficient of 
determination, 698-700 
residual analysis, 718-722 
summary, 723 
testing for significance, 702-708 
Multiple regression analysis, 687 
Multiple regression equation, 
687, 724 
Multiplication law, 196-197, 208 
Mutually exclusive events, 
190-191, 197 
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National Aeronautics and Space 
Administration (NASA), 172 
Network segments, 4 
Nielsen Company, The, 4, 10 
Nominal scale, 5 
Nonprobability sampling technique, 
338, 345 
Nonresponse error, 340 
Nonsampling error, 340-341, 670 
Normal curve, 279-281 
Normal probability density function, 
280, 298 
Normal probability distribution, 333 
computing, 285-286 
example of, 286-290 
Excel used to compute, 288-290 
normal curve, 279-281 
standard normal probability 
distribution, 281-285 
Normal probability plot, 660-662 
Normal scores, 660-661 
Null hypotheses, 398—401, 406 
Numerical measures 


adding to data dashboards, 153-156 


association between two variables, 
measures of, 144-153 

boxplot, 138-141 

data dashboards, 153-156 

distribution shape and relative 
location, measures of, 130-135 
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five-number summary, 138 
location, measures of, 105—121 
outliers, detecting, 134-135 
variability, measures of, 121-130 


0 


Observational study, 11-12, 553 

Observations of data, 5-7 

Observed level of significance, 409 

One-tailed tests, 401, 405-410, 
421-422, 506 

Online dating sites, 2 

Open-end class, 59 

Oracle, 22 

Ordinal scale, 5 

Outliers, 134-135, 138, 663-665 

Overall sample mean, 557, 595 

Overall significance, 702 


P 


Pairwise comparisons, 538-539, 
570, 572 
Parameters of a sampling 
population, 308 
Parameters of regression model (8), 
608-610, 631-632 
Partitioned scatter diagram, 146 
Partitioning, 563 
of sum of squares, 595 
Pascal, Blaise, 172 
Pearson, Karl, 607 
Pearson product moment correlation 
coefficient, 148-149, 159 
Per Capita GDP, 5-7, 14-15 
Percent frequency distribution, 38 
categorical variable, 38-40 
Excel used to create, 39-40 
quantitative variable, 50-51 
Percentiles, 113—114 
Permutations, 177—178 
Pie charts, 40-47, 86 
Excel used to create, 42-43 
Pivot point, 105-106 
Planning value for s, 375 
Point estimates, 317 
Point estimation, 316-319 
Point estimators, 105, 316-317, 319, 
322, 448 
erence between two population 


di 


means, 484 

erence between two population 
proportions, 474-475, 484 

estimated regression equation, 640 


di 


of population variance, 491 
Poisson, Siméon, 253 
Poisson probability distribution, 218, 
252-257 
Excel used to compute, 254-256 
exponential distributions, 
relationship between, 295-296 
length or distance intervals 
example, 254 
time intervals example, 253-254 


Poisson probability function, 253, 264 


Pooled estimates, 484, 557 

Pooled estimators of population 
proportions, 478 

Pooled sample variance, 464 

Population, 15, 306 


Population correlation coefficient (p), 
148-149, 637, 684 
Population covariance, 146, 159 
Population mean (u) 
formula, 158 
known, 405-420 
matched samples, inference about 
difference between, 467-474 
in null and alternative hypotheses, 
401 
one-tailed tests, 405-410 
population proportions, inference 
about difference between, 
474-482 
a, and g, known, inference about 
difference between, 447—456 
a, and g, unknown, inference 
between difference between, 
456-467 
sampling error, 340 
small sample, 369-371 
standard deviation, 227, 280, 286 
two-tailed tests, 410-415 
unknown, 420—428 
Population mean: ø known, 356-364 
Excel for, 361 
margin of error and the interval 
estimate, 356-362 
practical advice, 362 
Population mean: o unknown, 
364-374 
Excel for, 368-369 
interval estimation procedures, 371 
introduction to, 364—365 
margin of error and the interval 
estimate, 365-369 
practical advice, 369 
small sample, 369-371 
Population parameters 
cluster sampling, 337 
confidence intervals, 384-385 
definition of, 105 
estimated regression equation, 609 
hypothesis tests, 401, 417 
point estimator, 317 
sampling methods, 307-308 
Population proportions 
Excel for, 378-380, 430-431 


hypothesis tests, 377-378, 428-434 


in null and alternative hypotheses, 
401 
sample size, 380-381 
testing for equality of three or 
more, 534-543 
Population variances (o-,), 122-123 
between-treatments estimate of, 
559-560 
formula, 159 
hypothesis testing, 496-500 
inferences about, 491-502 
inferences about two, 503-513 
interval estimation, 491—495, 511 
within-treatments estimate of, 560 
Posterior probabilities, 201, 203 


p (Population correlation coefficient), 


148-149, 637, 684 
Practical significance, 436 


Prediction interval, 640-643, 673, 708 


Predictive analytics, 20 
Predictive models, 22 
Prescriptive analytics, 20 
Prior probability, 201 


Probability 
addition law, 188—190 
assigning, 178-180 
basic relationships of, 187—193 
Bayes’ theorem, 201-206 
complement of an event, 187-188 
computing using the complement, 
208 
conditional, 193-201 
counting rules, combinations, and 
permutations, 174-178 
definition of, 172 
events and, 183-187 
example of, 179-180 
experiments, 173-174 
introduction to, 172-173 
multiple, 174-177 
relative frequency method, 222 
subjective method, 222 
summary, 206-207 
Probability density function f(x), 
275, 277 
Probability distributions, 221 
Probability functions, 222 
Probability sampling techniques, 
337-339, 345 
Probability tree, 202. See also Tree 
diagrams 
Procter & Gamble (P&G), 274 
Producer Price Index, 4 
Production applications, 4 
Protected LSD test, 571 
pth percentile, 113 
p-value, 417, 434-436, 497—499, 521, 
528, 561-563, 633, 722 
p-value approach 
chi-square test, 527, 537 
critical value approach versus, 430 
hypothesis tests, 496, 506 
one-tailed tests, 407-408 
rejection rule, 521, 529, 562, 570, 
633, 635, 684, 703-704 
t test for significance in simple 
linear regression, 633 
two-tailed tests, 411-412, 415 


Q 


Quality control, 4 
Quantitative data, 7, 37, 59 
Quantitative variable, 7 
cumulative distributions, 55-56 
dot plot, 51-52 
frequency distribution, 47-48 
histogram, 52-55 
percent frequency distribution, 
49-51 
relative frequency distribution, 
49-51 
stem-and-leaf display, 56-59 
summarizing data for, 47—65 
Quartiles, 114-116, 134 
Quintiles, 116 


R 


Random experiments, 181 
Randomization, 554, 558 
Randomized block design 
air traffic controller stress test, 
576-577 
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ANOVA procedure, 577-578 
computations and conclusions, 
578-581 
Random numbers, 309 
Random sample, 306, 313-314, 340, 


standard deviation of, 323-324 
for treatment j, 595 

Sample points, 173, 176 

Sample proportion, 331-337 
expected value of, 332 


of ith residual, 674 

as measure of risk, 127 
population, 124, 227, 286, 464 
of sample mean, 323-324, 349 
of sample proportion, 333, 350 


to show relationships, 86 

in time series analysis, 82 
Selection bias, 313-314 
Side-by-side bar chart 

description of, 79-82, 85-86, 88 


454, 457 introduction to, 319-321 Excel used to create, 81-82 Standard error 
Random variables, 218—221, 263, 277 practical value of, 333-335 Significance, levels of, 359, 403-404 confidence intervals, 385 
Range, 122 sampling distribution of, 333 Significance tests, 403 formula, 349-350, 484 


Ratio scale of measurement, 7 
Regression equation, 607—609 
estimated, linear, 609, 
614-615, 672 
estimated, multiple, 687—688, 
691-693, 724 
Regression models, 607-610 


standard deviation of, 333 


Sample size 


determining, 374-377, 380-381 

hypothesis tests, 454 

interval estimate and, 362 

for interval estimate of population 
mean, 388 


confidence intervals, 633-634 

F test, 634-636 

interpretation of, 636-637 

mean square error, 631—632 
multiple regression, 702-708 
simple linear regression, 631—639 
t test, 632-633 


hypothesis tests, 478 
interval estimation, 448, 475 
sampling and nonsampling error, 
339-341 
Standard error of the estimate, 
632, 673 
Standard error of the mean, 323, 


multiple, 687—688, 700-702, 724 
simple linear, 607-610, 672 


328, 332 
Standard error of the proportion, 333 


for interval estimate of population Simple linear regression 


proportion, 388 big data and hypothesis testing in, 


Regression statistics output, in multiple regression, 722 670-671 Standardized residuals, 655—660, 665, 
interpretation of, 649 sampling distribution and, 327-328 coefficient of determination, 674, 718-720 
Rejection rule, 408-410, 521, 529, in simple linear regression, 670 621-629 Standardized value, 132, 134 


537, 562, 570-571, 633, 635, 
684, 703-704 
Relationships, displays for, 86 
Relative frequency, 38, 92 


Sample space, 173-174, 185 
Sample statistics, 105, 316 
Sample surveys, 15 


definition of, 607 

equation, 672 

estimated regression equation used 
for estimation and prediction, 


Standard normal density 
function, 281 
Standard normal probability 


Sample variances distribution, 281-285, 540 


Relative frequency distribution 
categorical variable, 38-40 
crosstabulation and, 67 
definition of, 38 
example of, 130-131 
Excel used to create, 39-40 
quantitative variable, 49-51 

Relative frequency method, 

178-179, 222 

Relative location, 130 

Replication, 554, 585 

Rescue methods, 172 

Research hypothesis, 398—400 

Residual analysis, 651—663 
introduction to, 651—652 
multiple regression, 718-722 
normal probability plot, 660—662 
residual plot, 652-655 
standardized residuals, 655-660 

Residual for observation i, 

651, 673 
Residual plots, 652-655, 657-660, 
662, 718-720 

Response surface, 702 

Response variables, 554, 556, 702 

Restricted LSD test, 572 

Return on equity (ROE), 11-12 

Rounding errors, 127 
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Sample, 15, 306 
Sample correlation coefficient, 148, 


Excel for, 125 

formula, 127, 159 

interval estimation, 491 

pooled sample variance, 464 

population variances and, 
122-123, 491 

standard deviation, 124-125 

for treatment j, 595 


Sampling 


big data and, 339-345 

errors in, 339-345 

example problem, 307-308 

from finite population, 308-312 
from infinite population, 312-314 
introduction to, 306-307 

point estimation, 316-319 

sample selection, 308-315 
summary, 348 


Sampling distributions 


big data and errors in sampling, 
339-345 

definition of, 321 

form of, 324-325 

inferences about two population 
variances, 503 

introduction to, 319-321 

judgment sampling, 339 

other sampling methods, 337-339 

of population variance, 491-492 

practical value of, 325-327 

of sample mean, 322-331 

sample proportion, 331-337 

sample size and, 327-328 

standard deviation of, 323-324 


639-645 
estimation process, 609 
influential observation, 665-667 
least squares method, 610-621 
model assumptions, 629-631 
model of, 607—610, 672 
outliers, 663—665 
residual analysis, 651—663 
summary, 671 
testing for significance, 631-639 
Simple random sample, 308-312, 
314, 319-320, 339, 358 
Simpson’s paradox, 69-71 
Single-factor experiment, 554 


Skewness of distributions, 130-131, 


296, 372 

o known, 356-364, 372, 405-420 

Slope, 612-614, 672 

Small Fry Design, 104 

Social media, 21, 342 

Spearman rank-correlation 
coefficient, 152 

SSA. See Sum of squares for factor 
A (SSA) 

SSAB. See Sum of squares for 
interaction (SSAB) 

SSB. See Sum of squares for factor 
B (SSB) 

SSBL. See Sum of squares due to 
blocks (SSBL) 


SSE. See Sum of squares due to error 


(SSE) 
SSR. See Sum of squares due to 
regression (SSR) 


Standard normal random variable, 
286, 298 
Stationarity assumption, 243 
Statistical analysis using Microsoft 
Excel, 16-19 
Statistical inference, 15—16, 
317, 345 
Statistical significance, practical 
significance versus, 637, 670 
Statistics 
business applications of, 3-4 
definition of, 3 
descriptive. See Descriptive 
statistics 
economics applications 
of, 3-4 
ethical guidelines for, 22-24 
introduction to, 2-3 
in practice, 2 
summary, 24 
Statistics in practice 
Bloomberg Businessweek, 2 
Burke Inc., 552 
Colgate-Palmolive Company, 36 
Food and Agriculture 
Organization, 306 
Food Lion, 356 
International Paper, 686 
John Morrell & Company, 398 
National Aeronautics and Space 
Administration (NASA), 172 
Procter & Gamble, 274 
Small Fry Design, 104 
United Way, 518 


151, 673 summary, 348 SST. See Sum of squares total (SST); U.S. Food and Drug 
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Microsoft® Office Excel® Functions 


Function Description 

AVERAGE Returns the arithmetic mean of a range its arguments. 

BINOM.DIST Returns the individual term binomial distribution probability. 

CHISQ.DIST Returns a probability from the chi-squared distribution. 

CHISQ.DIST.RT Returns the one-tailed probability of the chi-squared distribution. 

CHISQ.INV Returns the inverse of the left-tailed probability of the chi-squared distribution. 

CHISQ.TEST Returns the value from the chi-squared distribution for the statistic and the degrees 
of freedom. 

CONFIDENCE.NORM Returns the confidence interval for a population mean using the normal distribution. 

CORREL Returns the correlation coefficient between two data sets. 

COUNT Returns the number of cells in the range that contain numbers. 

COUNTA Returns the number of non-blank cells in the range. 

COUNTIF Returns the number of cells in a range that meet the specified criterion. 

COVARIANCE.S Returns the sample covariance. 

EXPON.DIST Returns a probability from the exponential distribution. 

F.DIST.RT Returns the right-tailed probability from the F distribution. 

GEOMEAN Returns the geometric mean of a range of cells. 

HYPGEOM.DIST Returns a probability from the hypergeometric distribution. 

MAX Returns the maximum value of the values in a range of cells. 

MMEDIAN Returns the median value of the values in a range of cells. 

MIN Returns the minimum value of the values in a range of cells. 

MODE.SNGL Returns the most-frequently occurring value in a range of cells. 

NORM.S.DIST Returns a probability from a standard normal distribution. 

NORM.S.INV Inverse of the standard normal distribution. 

PERCENTILE.EXC Returns the specified percentile of the values in a range of cells. 

POISSON.DIST Returns a probability from the poisson distribution. 

POWER Returns the result of a number raised to a power. 

QUARTILE.EXC Returns the specified quartile of the values in a range of cells. 

RAND Returns a real number from the uniform distribution between 0 and 1. 

SQRT Returns the positive square root of its argument. 

STDEV.S Returns the sample standard deviation of the values in a range of cells. 

SUM Returns the sum of the values in a range of cells. 

SUMPRODUCT Returns the sum of the products of the paired elements of the values in two ranges of cells. 

T.DIST Returns a left-tailed probability of the t distribution. 

T.INV.2T Returns the two-tailed inverse of the student's t-distribution. 

VAR.S Returns the sample variance of the values in a range of cells. 
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